Bayesian Sequential Detection with 
Phase-Distributed Change Time and Nonlinear 
Penalty - A POMDP Approach 

Vikram Krishnamurthy Fellow, IEEE 

t-H ! 

O 
£ 

>—i ; 

^ . Abstract 

We show that the optimal decision policy for several types of Bayesian sequential detection problems 
has a threshold switching curve structure on the space of posterior distributions. This is established 
by using lattice programming and stochastic orders in a partially observed Markov decision process 
(POMDP) framework. A stochastic gradient algorithm is presented to estimate the optimal linear 
approximation to this threshold curve. We illustrate these results by first considering quickest time 
detection with phase-type distributed change time and a variance stopping penalty. Then it is proved 

o\ . 

£SJ . that the threshold switching curve also arises in several other Bayesian decision problems such as 

quickest transient detection, exponential delay (risk-sensitive) penalties, stopping time problems in social 
learning, and multi-agent scheduling in a changing world. Using Blackwell dominance, it is shown that 
for dynamic decision making problems, the optimal decision policy is lower bounded by a myopic 
policy. Finally, it is shown how the achievable cost of the optimal decision policy varies with change 
■ time distribution by imposing a partial order on transition matrices. 
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I. Introduction 



Quickest time change detection has applications in biomedical signal processing, machine 
monitoring and finance [38], [5]. There are two general formulations for quickest time detection. 
In the first formulation, the change point t° is an unknown deterministic time, and the goal is 
to determine a stopping rule such that a certain worst case delay penalty is minimized subject 
to a constraint on the false alarm frequency (see, e.g., [31], [36], [53]). 

The second formulation, which is the formulation we consider in this paper, is the Bayesian 
approach. The change time r° is a random variable specified by a prior distribution. Consider 
a sequence of discrete time random measurements {yk,k > 1}, such that conditioned on the 
event {r° = t}, y k , k < t are i.i.d. random variables with distribution B\ and y k , k > t 
are i.i.d. random variables with distribution B 2 . The quickest time detection problem involves 
detecting the change time t° with minimal cost. That is, at each time k — 1, 2, . . ., a decision 
u k G {continue, stop and announce change} needs to be made to optimize a tradeoff between 
false alarm frequency and linear delay penalty. 

In classical Bayesian quickest time detection [44], [52], [38], the change time r° is modelled by 
a geometric distribution. A geometric distributed change time is realized by a two state discrete- 
time Markov chain, which we denote as x^. Therefore, in classical quickest time detection, the 
optimal decision policy at each time A; is a function of a two-dimensional belief state (posterior 
probability mass function) n k (i) = P(x k = i\y x , . . . , y k , U\, . . . , Wfe-i), i — 1,2 with 7r fc (l) + 
7Tfc(2) = 1. So it suffices to consider one element, say 71"^ (2), of this probability mass function. 
Classical quickest time change detection (see for example [38]) says that there exists a threshold 
point 7i* E [0, 1] such that the optimal decision policy is 



stop and announce change if 7rfc(2) < n* 

As a generalization of the Bayesian framework, [51], [52] consider dependent observations 
from a finite state Markov chain with transition probability matrix affected by the change point. 

Main Results and Organization of paper 

This paper considers Bayesian quickest time detection with the following generalizations: 
phase-type distributed change times, a variance stopping penalty, optimal linear threshold poli- 
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cies, and examples in transient detection, nonlinear delay penalty and stopping time problems 
in social learning. Our goal is to exploit lattice programming techniques to prove the existence 
of threshold optimal decision policies for a variety of quickest time detection problems. Below 
is an overview of these results. 

(i) Phase-type Distributed Change Times: We consider quickest time detection when the change 
time t° has a phase-type (PH) distribution [33]. PH-distributions are used widely in modelling 
discrete event systems. The optimal detection of a PH-distributed change point is useful since 
the family of PH-distributions forms a dense subset of the set of all distributions, i.e., for any 
given distribution function F such that F(0) = 0, one can find a sequence of PH-distributions 
{F n ,n > 1} to approximate F uniformly over [0, oo). As described in [33], a PH-distributed 
change time can be modelled by a multi-state Markov chain with an absorbing state. (For a 2- 
state Markov chain, the PH-distribution specializes to the geometric distribution). So for quickest 
time detection with PH-distributed change time, the belief states (Bayesian posterior) lie in a 
multidimensional simplex of probability mass functions. 

(ii) Variance Penalty: The second generalization we consider is a stopping penalty comprising 
of the false alarm and a variance penalty. The variance penalty is essential in stopping problems 
where one is interested in ultimately estimating the state x. It penalizes stopping too soon if the 
uncertainty of the state estimate is large. 1 Since the variance is quadratic in the belief state 7r, it 
is not possible to reformulate a variance penalty problem as a standard stopping time problem. 

Under what conditions does there exist an optimal threshold decision policy for quickest 
detection with PH-distributed change time and variance penalty! How can the belief states (in 
a multi-dimensional simplex) be ordered and compared with a threshold? Sec. II formulates the 
quickest time detection problem as a partially observed Markov decision process (POMDP) and 
characterizes the optimal decision policy as the solution of a stochastic dynamic programming 
problem. Using lattice programming [48] our main result (Theorems 1 and 2 in Sec. Ill) shows 
that the optimal decision policy is governed by a threshold switching curve on the space of 
Bayesian distributions (belief states). This result is useful for several reasons: (a) It provides a 
multi-dimensional generalization of (1) to PH-distributed change times, (b) Efficient algorithms 

'in [3], a continuous time stochastic control problem is formulated with a quadratic stopping cost, and the existence of the 
solution to the resulting quasi-variational inequality is proved. However, [3] does not deal with structural results. 



June 14, 2011 



DRAFT 



4 



can be designed to estimate optimal policies that satisfy this threshold structure, (c) The result 
holds under set-valued constraints on the change time and observation distribution. So there is 
an inherent robustness since even if the underlying model parameters are not exactly specified, 
the threshold structure still holds. 

Going from a 2 state Markov chain (geometric distributed change time) to multiple states 
(PH-distributed change time) introduces substantial complications. For 2 state Markov chains, 
the posterior distribution can be parametrized by a scalar (as in (1)) and therefore can be 
completely ordered. However, for more than 2 states, comparing posterior distributions requires 
using stochastic orders which are partial orders. In this paper we use the monotone likelihood 
ratio (MLR) stochastic order [41], [32], [22], [24] to prove our structural results. The MLR 
order is ideally suited for Bayesian problems since it is preserved under conditional expectations. 
However, determining the optimal policy is non-trivial since the policy can only be characterized 
on a partially ordered set (more generally a lattice) within the unit simplex. We modify the MLR 
stochastic order to operate on line segments within the unit simplex of posterior distributions. 
Such line segments form chains (totally ordered subsets of a partially ordered set) and permit 
us to prove that the optimal decision policy has a threshold structure. Theorem 3 shows that for 
linear delay and false alarm penalties, the stopping region is convex. 

(iii) Optimal Linear Threshold: Having established the existence of a threshold curve, Theorem 
4 gives necessary and sufficient conditions for the optimal linear hyperplane approximation to this 
curve. Then a simulation-based stochastic approximation algorithm (Algorithm 1) is presented 
to compute this optimal linear hyperplane approximation. 

The remainder of the paper illustrates the above structural results in several examples. 

(iv) Example 2: Quickest Transient Detection: Sec. IV considers quickest transient detection. 
We refer the reader to [39] for a nice description of the quickest transient detection problem 
and various cost functions. In quickest transient detection, a Markov chain state jumps from a 
starting state to a transient state at a geometric distributed time, and then jumps out of the state 
to an absorbing state at another geometric distributed time. We show in Theorem 5 that a similar 
structural result to quickest time detection holds (i.e., existence of a threshold switching curve 
and convexity of the stopping region). 

(v) Example 3: Quickest Time Detection with Exponential Delay Penalty: In Sec.V, we gen- 
eralize the results of Poor [36] to PH-distributed change times. Poor [36] considers a novel 
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variation of the quickest time detection problem where the time delay in the detection is penalized 
exponentially. By converting the resulting problem into a standard stopping problem [12], it is 
shown in [36] that the optimal decision policy is a threshold policy under mild conditions. 

The exponential delay penalty cost function in [36] is a special case of risk sensitive stochastic 
control with geometric change times. Assuming more general PH-distributed change times, 
Theorem 6 shows that the optimal detection policy is characterized by a threshold switching 
curve and the stopping region is convex in the risk-sensitive belief state. 

Risk sensitive stochastic control is widely used in mathematical finance, see [6], [17], [14] 
for comprehensive treatments in discrete and continuous time. In simple terms, quickest time 
detection seeks to optimize the objective E{ J } where J° is the accumulated sample path cost 
until some stopping time r. In risk sensitive control, one seeks to optimize J = E{exp(eJ }. 
Note that eJ can be written as eJ = e + e 2 E{ J } + higher order terms. It therefore follows 
that for e > 0, the scaled cost e J and hence J is robust and penalizes heavily large sample path 
costs due to the presence of second order moments. This is termed a risk-averse control and 
is of significant importance in mathematical finance, see [6]. Risk sensitive control provides a 
nice formalization of the exponential penalty delay cost and allows us to generalize the results 
in Poor [36] to phase-distributed change times by applying lattice programming. 

(vi) Example 4 and 5: Stopping Time Problems in Multi-agent Social Learning: Sec.VI presents 
two examples of stopping time problems involving social learning amongst multiple agents. We 
consider: How do local decisions in social learning affect the global decision in a stopping 
time problem? Social learning has been used in economics [2], [8], [11], for example to model 
behavior in financial markets; see also [27], [46]. In social learning, each agent optimizes its 
local utility selfishly and then broadcasts its action. Subsequent agents then use their private 
observation together with the actions of previous agents to learn an underlying state. 

Our first result (Example 4) deals with a multi-agent Bayesian stopping time problem where 
agents perform greedy social learning and reveal their local actions to subsequent agents. How can 
the multi-agent system make a global decision when to stop? Such problems arise in automated 
decision systems (e.g., sensor networks) where agents make local decisions and reveal these 
local decisions to subsequent agents. Theorem 8 shows that the optimal decision policy of the 
stopping time problem has multiple thresholds. This is unusual: if it is optimal to declare state 1 
based on a Bayesian belief, it may not be optimal to declare state 1 when the belief about state 
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1 is stronger. We also give an explicit example of an optimal double threshold policy. The result 
shows that making global decisions based on local decisions involves non-monotone policies. 

Our second result (Example 5) deals with "constrained optimal" social learning. A key result 
in social learning is that rational agents eventually herd, that is, they pick the same action 
irrespective of their private observation and social learning stops. To enhance social learning, 
Chamley [11] (see also [45] for related work) formulated constrained social learning as a stopping 
time problem where agents either reveal their observations or they herd (which is equivalent to 
stopping in a sequential decision problem). When should a multi-agent system make the global 
decision to stop (herd)? Intuitively, the decision to stop should be made when the state estimate 
is sufficiently accurate so that revealing private observations is no longer required. Theorem 9 
in Sec. VI shows that the constrained optimal social learning proposed by Chamley [11] has a 
threshold switching curve in the space of public belief states. Thus the global decision to stop 
in [11] can be implemented efficiently in a multi-agent system. 

(vii) Example 6: Multi-agent Scheduling in a Changing World: In Sec. VII we examine: 
How can the optimal decision policy be bounded in terms of a myopic policy? How does 
the achievable cost of the optimal policy vary with transition probabilities (and therefore change 
time distribution)? We answer these two questions in a general setting where optimal decisions 
need to be made when the underlying state x evolves according to a finite state Markov chain 
without necessarily having an absorbing state. The problem is no longer a stopping problem; it 
is a more general partially-observed stochastic control problem. 

To formulate these results, Sec. VII considers a multi-agent scheduling problem. Using Black- 
well dominance, Theorem 10 shows that the optimal policy is lower bounded by a myopic 
policy. The myopic policy can be computed efficiently and is a rigorous lower bound to the 
computationally intractable optimal policy. Finally, Theorem 11 examines how the optimal 
expected cost varies with transition matrix for a stopping time problem (e.g., quickest detection 
problem) and more general dynamic decision problem in a changing world. The theorem shows 
that for the underlying Markovian state, the larger the transition matrix (according to an order 
defined in Sec.VII), the cheaper the optimal expected cost. 
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II. Partially Observed Stochastic Control Formulation 

In this section we present a partially observed stochastic control formulation that allows us to 
tackle the various stopping time problems considered in subsequent sections. 

A. Stopping-Time Stochastic Control Model 

The model comprises of the following ingredients 

1. Absorbing-state Markov chain and Phase-Type Distribution Change Time: We model the 
change point r° by a phase type (PH) distribution. The family of all PH-distributions forms a 
dense subset for the set of all distributions [33] and hence can be used to approximate change 
points with an arbitrary distribution. This is done by constructing a multi-state Markov chain 
as follows: Let k = 0, 1, . . . denote discrete time. Assume the state of nature xt evolves as a 
Markov chain on the finite state space 

{ei, . . . , ex} where is the X-dimensional unit vector with 1 in the i-th position. (2) 

Here state '1' (corresponding to ei) is an absorbing state and denotes the state after the jump 
change. The states 2, . . . , X (corresponding to e 2 , . . . , ex) can be viewed as a single composite 
state that x resides in before the jump. Denote 

X = {1,2,...,X}. (3) 

We assume that the change occurs after at least one measurement. So the initial distribution 

tto = (no{i),i e X), 7r (Y) = P{x Q = e;) satisfies 7r (l) = 0. (4) 
The X x X transition probability matrix P with elements P^ = P(x fc+1 = ej\xk = e^) is 

1 

P= (5) 

E(X-l)xl P(X-l)x(X-l) 

Let the "change time" r° denote the time at which Xk enters the absorbing state 1, i.e., 

r° = inf{A; : x k = 1}. (6) 

The distribution of r° is determined by choosing the transition probabilities P, P in (5). To 
ensure that r° is finite, assume states 2, 3, ... X are transient. This is equivalent to P satisfying 
Y^m=i PS < oo for i = 1, X — 1 (where Pg denotes the (i, i) element of the n-th power of 
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matrix P). The distribution of r (which is equivalent to the distribution of the absorption time 
to state 1) is given by 

z/ = 7T (l), v k = ir' P k - 1 P, k>l, (7) 

where tt = [vr (2), . . . , ir (X)]'. The key idea is that by appropriately choosing the pair (tt , P) 
and the associated state space dimension X, one can approximate any given discrete distribution 
on [0, oo) by the distribution {v k , k > 0}; see [33, pp. 240-243]. The event {x k = 1} means the 
change point has occurred at time k according to PH-distribution (7). Of course, in the special 
case when x is a 2-state Markov chain, the change time t° is geometrically distributed. 

2. Observation: At time k, the noisy observation y k E Y given state x k has conditional 
probability distribution 

P(Vk < V\xk = e 4 ) = ^2 B iy i e X - ( 8 ) 

y<y 

Here J2 y denotes integration with respect to the Lebesgue measure (in which case YcK and B iy 
is the conditional probability density function) or counting measure (in which case Y is a subset 
of the integers and B iy is the conditional probability mass function B iy = P(yk = y\xk = e^)). 

3. Belief State: At time k, the belief state is the posterior probability mass function of x k 
given the observation history y 1 , . . . , y k and past decisions u\, . . . , u^-i- That is 

7T fc = (vr fc (z), ieX), ix k (i) = P(x k = ei\y u . . . , y k , u u . . . , u k -i), initialized by ir . (9) 

Equivalently, denote the filtration 

7 k = cr-algebra generated by (y u . . . , y k , u u . . . , u fc _i). (10) 

Then n k = ~E{xk\J r k}- (The notational advantage of choosing unit vectors (2) for the state space 
is that conditional probabilities and conditional expectations coincide). 

The belief state is updated via the Bayesian (Hidden Markov Model) filter 

7i k = T^k-uVk), where T{n,y) = a(-K,y) = l' x B y P'n (11) 

B y =diag(P(y\x = e i ),i e X). 

Here lx denotes the X dimensional vector of ones. The belief state n in (1 1) is an X-dimensional 
probability vector. It belongs to the X — 1 dimensional unit-simplex denoted as 

U(X) = {it e R x : l' x iv = 1, < ?r(i) < 1 for all i E X} . (12) 
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For example, 11(2) is a one dimensional simplex (unit line segment), 11(3) is a two-dimensional 
simplex (equilateral triangle); 11(4) is a tetrahedron, etc. The states e 1; e 2 , . . . , e x of the Markov 
chain x are the vertices of n(X). 

4. Sequential Decision and Costs: At each time k, a decision w fc is taken where 



In (13), the policy fi belongs to the class of stationary decision policies denoted [i. 

(i) Cost of announcing change and stopping: If decision u k = 1 is chosen, then the problem 
terminates, li u k = 1 is chosen before the change point r°, then a false alarm and variance 
penalty is paid. If u k = 1 is chosen at or after the change point r°, then only a variance penalty 
is paid. Below we formulate these costs. 

Let g = (<7i, . . . , gx)' specify the physical state levels associated with states 1,2, ... ,X of 
the Markov chain x. The variance penalty is 



E{|| (a* - n k )'g\\ 2 | T k } = G'n k (i) - (g'7i k ) 2 , where G t = g 2 and G = (G U G 2 , . . . , G x ). (14) 



This conditional variance penalizes choosing the stop action if the uncertainty in the state estimate 
is large, see also [3]. 

Next, the false alarm event Ui> 2 {x k = ej} D {u k = 1} = {x k ^ ei} fl {u k = 1} represents the 
event that a change is announced before the change happens at time r°. To evaluate the false 
alarm penalty, let fd[x k — e^, u k — 1) denote the cost of a false alarm in state e$, i G X, where 
fi > 0. Of course, fi = since a false alarm is only incurred if the stop action is picked in 
states 2, . . . , X. The expected false alarm penalty is 



Y J fiW(xk = e i ,u k = l)\F k } = ?K k I(u k = l), where f =(/i,..., /*)', /i = 0. (15) 



iex 

The false alarm vector f is chosen with increasing elements so that states further from state 1 
incur larger penalties. 

Then with a, (3 denoting non-negative constants that weight the relative importance of these 
costs, the expected stopping cost at time k is 



One can also view a informally as a Lagrange multiplier in a stopping time problem that seeks 
to minimize a cumulative cost (as in (20) below) subject to a variance stopping constraint. 



u k = )u(7rjfc) G U = {1 (announce change and stop), 2 (continue) }. 



(13) 



C(n k ,u k = 1) = a(G'7t k - (g'n k ) 2 ) + (3f'ir k . 



(16) 
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(ii) Delay cost of continuing: We allow two possible choices for the delay costs: 

(a) If decision u k = 2 is taken then {x k+1 = e 1 ,u k = 2} is the event that no change is declared 
at time k even though the state has changed at time k + 1. So with d denoting a non-negative 
constant, dl{x k +i = ei,u k = 2) depicts a delay cost. The expected delay cost for decision 

itfc = 2 is 

C(ir k ,u k = 2) = dE{I(x k+ i = e x ,u k = 2)\F k ) = de x P'ix k . (17) 

The above cost is motivated by applications (e.g., sensor networks) where if the decision maker 
chooses u k = 2, then it needs to gather observation y k+ i thereby incurring an additional opera- 
tional cost denoted as c. Strictly speaking, C(n, 2) = de^P'n+c. Without loss of generality set the 
constant c to zero, as it does not affect our structural results. The penalty dl(x k +i = e±, u k = 2) 
gives incentive for the decision maker to predict the state x k+x . 

(b) Instead of the above, the more 'classical' formulation is that a delay cost is incurred when 
the event {x k = e\,u k = 2} occurs. Then the expected delay cost is 

C(n k ,u k = 2) = dE{I(x k = ei,u k = 2)|.F fe } = de[ii k . (18) 

Remark: Due to the variance penalty, the cost C(ir, 1) in (16) is quadratic in the belief state ir. 
Therefore, the formulation cannot be reduced to a standard stopping problem with linear costs 
in the belief state. 

B. Quickest Time Detection Objective 

Let J 7 ) be the underlying measurable space where ft = (XxWx Y)°° is the product space, 
which is endowed with the product topology and T is the corresponding product sigma-algebra. 
For any 7r £ n(X), and policy fj, £ /j,, there exists a (unique) probability measure on 
J 7 ), see [16] for details. Let E£ () denote the expectation with respect to the measure P^ . 2 

Let r denote a stopping time adapted to the sequence of a-algebras T k ^k > 1, defined in 
(10). That is, with u k determined by decision policy (13), 

r = {inf k : u k = 1} . (19) 

2 The formulation on Q. — (X x U x Y)°° is as follows, see [16]. Augment H(X) to include the fictitious stopping state ex+i 
which is cost free, i.e., C(ex+i,u) = for all u £U. When decision Uk = 1 is chosen, the belief state nu+i transitions to 
ex+i and remains there indefinitely. Then (20) is equivalent to ^(710) = E^ {X]fc=i p h ~ 1 C('Kk, Uk = 2) + p T ~ 1 C(iv T , u T = 
1) + X^r+i p k ~ 1 C(ex+i, Uk)}, where the last summation is zero. 
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For each initial distribution ir G n(X), and policy p, the following cost is associated: 

r-l 

J M (7T ) = E^{^p fe - 1 C(vr fc , Mfc = 2) +p T - 1 C'(7r r , Mr = 1)}. (20) 

k=l 

Here p G [0, 1] denotes an economic discount factor. Since C(ir, 1), C(ir, 2) are non-negative and 
bounded for all ir G H(X), stopping is guaranteed in finite time, i.e., r is finite with probability 
1 for any p G [0, 1] (including p = 1). 

Remark: For the special case a = 0, X = {1, 2} (i.e., geometric distributed change time), f = e 2 , 
and delay cost (17), it is easily shown that 

M*o) = dK {(r - r°) + } + f3K (r < r°) + dP 21 E^{(r° - l)/(r° < r)} (21) 

(where r° is defined in (6) and r is defined in (19)). For the delay cost (18), the cost function 
assumes the classical Kolmogorov-Shiryayev criterion for detection of disorder [43], namely 

J„(tto) = dW^{{r - r°) + } + /?P£ o (r < r°). (22) 

■ 

The goal is to determine the change time t° defined in (6) with minimal cost, that is, compute 
the optimal policy p* G n to minimize (20), i.e., J^*{txq) = inf M6/tl J M (7r ). The existence of an 
optimal stationary policy p* follows from [7, Prop. 1.3, Chapter 3]. Considering the above cost 
(20), the optimal stationary policy p* : n(X) — > {1,2} and associated value function V(ir) are 
the solution of the following "Bellman's dynamic programming equation" 

P*(tt) = arg min{C , (vr, 1), C(ir, 2) + p ^ V (T(ir, y)) a{>K, y)} (23) 

J,, (tt) = V(ir) = min{C(vr, 1), C(ir, 2) + p ^ V (T(tt, y)) a(vr, y)} 

Before proceeding, we rewrite the above in a form that is more amenable 3 for analysis. Define 

V(tt) = V^tt) - (a + 0)f'ir, C(tt, 1) = a(GV - (A) 2 ) - afV 
C(tt, 2) = <7(7r, 2) - (a + /3)fV + p(a + 0)f P'n. (24) 

3 The reason for changing coordinates from C(tt, 1), C(tt,2) to C(ir, 1), C(7T, 2) is to make our analysis compatible with 
existing results in quickest time detection. To ensure this compatibility, we need C(n, 1) to be decreasing with respect to the 
MLR order (see Appendix) when a = 0. As shown in the proof of Theorem 1 in the Appendix, C(ji, 1) is MLR decreasing. 
In comparison C(n, 1) is not MLR deceasing. Of course, the stopping set TZi (see (26)) and optimal policy /i* are invariant to 
the choice of coordinates. This idea of changing coordinates is described in [13], albeit for the simpler fully observed Markov 
decision process case. 
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Then clearly V{ji) satisfies Bellman's dynamic programming equation 

p*{ji) = argminQfvr, u), J u * M = V (ti) = min Q(n,u), (25) 

u&A «G{1,2} 

where Q(tt, 2) = C(tt, 2) + p ]T V (T(tt, y)) ^tt, y), Q(tt, 1) = C(tt, 1) 
Thus the goal is to determine the optimal stopping set 

TZ 1 = {ne IL{X) : //(vr) = 1} = {tt G n(X) : C(tt, 1) < C(tt, 2) + pj] V (T(tt, y)) a(ir, y)} 

= {ne U(X) : C(ir, 1) < <7(tt, 2) + p ^ 7 (T(tt, y)) a(7r, y)}. (26) 

In Sec.III-B, sufficient conditions are given to ensure that TZi is non-empty. 
Value Iteration Algorithm and methodology: We comment briefly here on our analysis method- 
ology which is detailed in Sec. III. Let k = 1,2,..., denote iteration number (the fact that we 
used k previously to denote time should not result in confusion). The value iteration algorithm 
is a fixed point iteration of Bellman's equation and proceeds as follows: 

Vo(tt) = -(a + /3)fV, V fc+ i(7r) = min Q k+1 (ir, u), /4 +1 (tt) = argmin u6{12} Q k+1 (Tr, u) 

uS{l,2} 

where g fc+1 (vr, 2) = C(tt, 2) + p ^ V k (T(vr, y)) a(vr, y), Q fe+1 (7r, 1) = C(tt, 1). (27) 

Let i3(X) denote the set of bounded real- valued functions on IT (X). Then for any V and 
V G define the sup-norm metric sup||V(7r) - ^(tt)||, tt e EE(Jf). Then fi(Jf) is a 

Banach space. The value iteration algorithm (27) will generate a sequence of value functions 
{Vk} C B(X) that will converge uniformly (sup-norm metric) as k — > oo to V(n) G 
the optimal value function of Bellman's equation. However, since the belief state space n(X) 
is an uncountable set, the value iteration algorithm (27) do not translate into practical solution 
methodologies as Vk(n-) needs to be evaluated at each ix G Tl(X), an uncountable set. Indeed, due 
to the nonlinearity in the belief states, the formulation is more complex than a partially observed 
Markov decision process which is known to be PSPACE hard [34]. Although value iteration is 
not useful from a computational point of view, in Sec. Ill, we exploit the structure of the value 
iteration recursion (25), (27) to prove that 1Z\ is characterized by a threshold switching curve. 
We then exploit this structure to devise polynomial complexity algorithms for approximating the 
optimal policy p* and thus determining the stopping set 1Z\. 
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Remark: Computational algorithms based on value iteration such as Sondik's algorithm, Mon- 
ahans's algorithm, Cheng's algorithm, Witness algorithm (see [9], [10] for a tutorial description) 
and Lovejoy's suboptimal algorithms [29] solve POMDPs with linear costs (i.e., a = 0) over finite 
horizons. These algorithms require finite observation spaces and are computationally intractable 
except for small X and Y. They are not applicable directly to stopping problems considered in 
this paper since we consider nonlinear penalty costs, possibly continuous observation space Y 
(Examples 1, 2 and 3), and problems where the observation probabilities depend on the belief 
state (social learning in Examples 4 and 5). 

III. Example 1: Quickest Time Detection with PH-distributed Change Time and 

Variance Penalty 

This section considers quickest time detection with PH-distributed change time and variance 
penalty. Sec.III-A below gives the main results of this paper, namely the optimal decision policy 
is characterized by a threshold curve. Sec.III-B and III-C discuss the implications and main 
assumptions. Sec.III-D then parametrizes the optimal linear approximation to this threshold curve. 
Finally, Sec.III-E gives a stochastic optimization algorithm (Algorithm 1) to compute this optimal 
linear approximation. 

The quickest time detection problem is a special case of the stochastic control problem 
formulated in Sec. II. The states 2, 3, . . . . , X are fictitious and are defined to generate the change 
time t° with PH-distribution (7). So states 2,3, .... ,X are indistinguishable in terms of the 
observation y. That is, the observation probabilities B in (8) and Markov chain state levels g in 
(14) satisfy 

B 2y = B 3y = ■ ■ ■ = B Xy for all y e Y, 9l = 0, g 2 = g 3 = ■ ■ ■ = g x = 1. (28) 

The above choice of g = lx — e-i is without loss of generality since the variance penalty (14) 
is translation invariant with respect to cl x for any c. 

Notation: Notation and definitions regarding stochastic orders, lattice programming, the poset 
[n(X),> r ] and submodularity are given in Appendix A. Below > r denotes the monotone 
likelihood ratio order, > Lx denotes the likelihood ratio order on lines C(e x ,^), >£i denotes 
the likelihood ratio order on lines £(ei,7f), and > s denotes first order stochastic dominance. 
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A. Main Result: Existence of Decision Curve Policy for Quickest Time Detection 

This section gives three main results: The optimal policy for quickest detection with PH- 
distributed change time and variance penalty is characterized by a threshold curve (Theorems 1 
and 2). Also for a = 0, it is shown that the stopping set 1Z\ is convex (Theorem 3). 

1) Quickest Detection with Delay Penalty (17): For the stopping cost C(n, 1) in (16), choose 
f = [0, 1, • • • , 1]' = lx — ei. This weighs the states 2, . . . , X equally in the false alarm penalty. 
With assumption (28), the variance penalty (14) becomes a(e' 1 7i — (e' 1 7i) 2 ). The delay cost C(ir, 2) 
is chosen as (17). To summarize (24), (25), (26) hold with 



Theorem 1 below is our main result on the structure of the optimal decision policy p*(7r). It 
is based on the following assumptions (discussed in Sec.III-C). 

(Al-Exl) d > p(a + P) 

(A2) The observation distribution B xy in (8) is TP2 in (x, y) (see Defn.4(iii) in Appendix A). 

Equivalently, from (28), B 2y > r B ly . 
(A3) The transition matrix P in (5) is TP2, i.e. all its second order minors are non-negative. 
(S-Exl) (d - p{a + - P 21 ) >a-(3 

(Al-Exl) and (S-Exl) are constraints on the delay and stopping cost functions (that the 
decision maker can design), while (A2) and (A3) are assumptions on the underlying observation 
(8) and PH-distribution (6). 

Theorem 1 (Switching Curve Optimal Policy): Consider the quickest time detection problem 
(20) with costs defined in (29) and PH-distributed change time r° defined in (6). Then for 
7r G n(X), under (Al-Exl), (A2), (A3), (S-Exl), there exists an optimal policy /i*(vr) that is 
>l x increasing on lines £(ex,7f) and >l 1 increasing on lines £(ei,7r). As a consequence: 
(i) The stopping set 1Z\ defined in (26) has the following structure: There exists a threshold 
switching curve Y that partitions belief state space II (X) into two individually connected regions 
Tli, IZ2, such that the optimal policy is 



C(vr,l) = a ( 



ei7r — 



(e» 2 ) + 0(1 - e», C(tt, 2) = de[P'n. 



(29) 




continue = 2 if n G IZ2 



if it Ell 



(30) 



June 14, 2011 



DRAFT 



15 



(A set is connected if it cannot be expressed as the union of two disjoint nonempty closed sets 
[42]). The threshold curve T intersects each line C(ex,Tt) and £(ei,7f) at most once. 

(ii) There exists an i* £ {0, . . . , X}, such that ei, e 2 , . . . , £ 72. i and ej* + i, . . . , £ 1Z 2 - 

(iii) For geometric distributed change time r°, there exists a unique threshold point 7r*(2) such 
that (1) holds. (Note (A3) holds trivially in this case). ■ 

Theorem 1 is proved in Appendix C and uses meta Theorem 13 in Appendix B as a key step. 
The intuition behind Theorem 1 is discussed in Sec.III-B and III-C below. Fig.l gives a pictorial 
illustration. Note that if a = 0, then (S-Exl) holds trivially if (Al-Exl) holds. 

2) Quickest Detection with Delay Penalty ( 18): Next consider the 'classical' delay cost C (n, 2) 
in (18) and stopping cost C(n, 1) in (16) with g in (28). Then (24), (25), (26) hold with 

C(tt,1) = a (e' l7 r- (e^vr) 2 ) +/3fV fc , C(ir, 2) = de[-K. (31) 

Below we show that Theorem 1 continues to hold, if the decision maker designs the false alarm 
vector f to satisfy the following linear constraints: 
(AS-Exl) (i) ft > max{l,p2±2f'P'e < + *f}, % > 2. 

(ii) fj -fi> pi'P'ie, - j > i, i £ {2, . . . , X - 2} 

(iii) fx~fi> ^±%'P'(e x - e,), % £ {2, . . . , X - 1}. 

Feasible choices of f are easily obtained by a linear programming solver. 

Theorem 2: Consider the quickest detection problem with delay and stopping costs in (31). 
Then under (AS-Exl), (A2), (A3), Theorem 1 holds. ■ 

3) Convexity of Stopping Region when a = 0: Finally, we present the following result for the 
case a = 0, i.e., no variance penalty. 

Theorem 3: For arbitrary PH-distributed change time r°, and no variance penalty (a = 0), 
the stopping region 72i is a convex subset of H(X). ■ 

The proof of Theorem 3 is in Appendix D; it was proved in [28] in a POMDP setting. 
Theorem 3 says that as long as costs C(ir, 1), C(n, 2) are linear in n (i.e., no variance penalty), 
then the stopping set is convex for any size X (i.e., arbitrary PH-distribution); no assumptions are 
required on the transition matrix P or observation likelihood matrix B. However, even though 
72i is convex (and therefore connected), Theorem 3 does not guarantee that 72-2 is connected. 
As described in Sec.III-C, Theorem 1 and Theorem 2 go much further than Theorem 3 in 
characterizing 1Z\ and 72 2 , even for the case a = 0. 
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B. Discussion of Theorems 1, 2 

Theorems 1 and 2 imply that since the optimal decision policy is characterized by a threshold 
curve, quickest time detection for PH-distributed change times and variance penalty can be 
implemented efficiently, see Sec.III-D. Without this result, the stopping set 1Z\ is not necessarily 
a connected region as will be shown in Sec.VIII. 

1 ) Geometric distributed change time: When the change time r° is geometrically distributed, 
since the state space X = {1,2}, H{X) is a one dimensional simplex. Then the stochastic orders 
>r, >Li and > s defined in Appendix A coincide, and become total orders. Also (A3) holds 
automatically for this case. Below we discuss the cases of a ^ and a = 0. 

(i) a ^ 0: For geometric distributed change time, Theorems 1 and 2 say that the classical 
threshold policy depicted in (1) continues to hold when a nonlinear variance penalty is considered. 
For example, consider Theorem 2 with non-zero a, delay in (31) and false alarm vector f = 62- 
So the false alarm cost is fit = 1 — e[n (which is identical to (29)). One can view this as 
the Kolmogorov-Shiryayev criterion (22) with an additional variance penalty. Theorem 2 holds 
under the conditions (AS-Exl) and (A2). Here (AS-Exl) equivalent to the constraint that a < 
d+ i+p£ 22) - ( Choose h = °' h = 1 in (AS-Exl)(i)). So for a < d/2, (AS-Exl) always holds. 

(ii) a = 0: For quickest time detection with geometric distributed change time and no variance 
penalty (a = 0), the well known existence of a threshold point (e.g., for the Kolmogorov- 
Shiryayev criterion (22)) follows trivially from Theorem 3. Since II (X) is a one dimensional 
simplex, convexity of stopping set 1Zi (Theorem 3) implies that there is a threshold point n* 
that satisfies (1). 

2) Avoiding trivial cases: To ensure the stopping set IZi contains state e±, assume C(ei, 1) < 
C(ei,2). From (29) or (31) this is equivalent to d > 0. The strict inequality also implies that 
IZi fl U°(X) is non-empty, where H°(X) denote the interior of the simplex II(X). 

For the detection problem to be non-trivial, we want C{e^ 1) > C(ei, 2) for % > 2, otherwise 
it is always optimal to stop at time 1. For the case of Theorem 1, from (29), a sufficient 
condition is that (5 > dPn, % — 2, . . . ,X. Since the transition matrix P is TP2, it follows 4 that 

4 Proof: We prove the contrapositive, that is, Pn < Pi+1.1 implies P is not TP2. Recall from (A3-Exl), TP2 means that 
PiiPi+ij > Pi+i t iPij for all j. So assuming Pn < Pi+i,i, to show that P is not TP2, we need to show that there is at least 
one j such that P%+i,j < Py. But Psi < Ps+1,1 implies Pi+i,k < Pik> which in turn implies that at least for one 

j, Pi+i,j < Pij- 
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P21 > Psi > • • ■ > Pxi- Therefore, it is sufficient for the decision maker to choose constants 
ft and d such that f3 > dP 2 i. For Theorem 2, from (31), Cfe, 1) > (7(ej,2) always holds for 
f3 > and f, t > 0, i > 2. 

Non-degenerate Threshold Curve: Let n°(X) and n 6 (X), respectively, denote the interior 
and boundary of the simplex II (X). Determining the threshold switching curve T in Theorem 
1 requires determining T fl 17° (X) (portion of curve that lies in the interior of the simplex) and 
T n H b (X) (portion of curve that lies on the boundary of the simplex). Since H b (X) comprises 
of sub-simplices, to determine T fl H b (X) one would need to search for the threshold curve T 
within these sub-simplices. While conceptually straightforward, we can eliminate this search by 
ensuring that the belief state 7r always lives in the interior 17° (X) of the simplex. The following 
lemma gives sufficient conditions for the sequence of belief states ix k over time to lie in IT (X). 

Lemma 1: Suppose each column of transition matrix P has at least one non-zero element, and 
the observation likelihoods satisfy B\ y 7^ for y E Y. Then for initial belief state 7r satisfying 
(4) with 7r (2) ^ 0, i > 1, subsequent belief states ir k lie in il°(X) U {ei} for all time k > 1. 

The proof follows straightforwardly from the belief state update (11). Since the sequence of 
belief states n k , k > 1, lives in U°(X), one only needs to compute the threshold curve inside 
the simplex, i..e., T fl U°(X). Recall from the previous remark that IZi fl Ii°{X) is non-empty 
and consequently Y fl I1 (X) is non-empty. 

C. Assumptions and Proofs of Theorem 1 and Theorem 2 

Below we discuss the main assumptions of Theorem 1 and Theorem 2 in Sec.III-A, then 
outline the structure of the proof, and finally give intuitive examples that illustrate the structure 
of stopping set 1Z\. 

1) Discussion of Assumptions: Recall (Al-Exl) and (S-Exl) are design constraints the de- 
cision maker uses to choose the stopping and delay costs. In contrast, (A2) and (A3) are 
assumptions on the underlying stochastic model. 

As described in the Appendix B, (Al-Exl) is sufficient for C(7r,2) to be > r decreasing. We 
also require C(ir, 1) in (29) to be > r decreasing, but this holds trivially in our setup. 

(S-Exl) is a submodularity condition, see Defn.5 in Appendix. We refer to [1], [48] for exten- 
sive treatments of lattice programming and submodularity. The key idea is that if Q(ir, u) is sub- 
modular on the partially ordered set [n(X), > r ], then the optimal policy yU*(7r) = argmin^ Q(tt, u) 
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is monotone increasing with respect to > r . 

In our setting, submodularity of Q(ir, u) in (25) is equivalent to showing that Q(ir, 2)—Q(tt, 1) 
is decreasing with respect to n (in terms of the MLR order > r , see discussion of structure of 
proof given below). Since by (Al-Exl), C(ir, 1) and C(ir,2) are MLR decreasing in n, it will 
be proved that V(tt) is MLR decreasing in n providing (A2) and (A3) hold. Since the sum of 
submodular functions is submodular, establishing submodularity of Q(ir,u) in (25) is equivalent 
to establishing submodularity of C(ir,u). Clearly if a = 0, then from (29), C(n,l) = is 
independent of n. So if a = then the submodular condition (S-Exl) holds trivially since 
C(n,2) is decreasing in ix via (Al-Exl). So for quickest time detection with PH-distributed 
change time and no variance penalty, submodularity holds by construction. Note that when the 
variance penalty is included, (S-Exl) always holds if a < /3. 

AS-Exl in Theorem 2 ensures that C(ir, 1) and C(ir,2) in (24) for the modified delay cost 
in (31) are monotone decreasing and that C(ir, u) is submodular. It is analogous to (A-Exl) and 
(S-Exl). 

(A2) is required for preserving the MLR ordering with respect to observation y of the Bayesian 
filter update T(ir, y) - this is a key step in showing V(ir) is MLR decreasing in %. Theorem 13(ii) 
in the Appendix states that T(ir, y) is MLR increasing in y, iff (A2) holds. 

(A2) is satisfied by numerous continuous and discrete distributions, see a classical detection 
theory book such as [37]. Examples include Gaussians, Exponential, Binomial, Poisson, etc. 

(A3) is essential for the Bayesian update T(ir, y) preserving monotonicity with respect to 
7r. Theorem 13(1) in the appendix shows that T(n,y) is MLR increasing in ir iff P'n is MLR 
increasing in 7T, and (A3) is a sufficient condition for the latter. TP2 stochastic orders and kernels 
have been studied in great detail in [18]. 

(A3) is satisfied by several classes of transition matrices; see [20], [19]. Consider, for example, 
a tridiagonal transition probability matrix P with = for j > i + 2 and j < i — 2. As shown 
in [15, pp. 99-100], a necessary and sufficient condition for tridiagonal P to be TP2 is that 

PiiPi+l,i+l Pi,i+\Pi+l,i- 

2) Structure of Proof of Theorem 1: The proof in the appendix comprises of three steps. Steps 
1 and 2 below are proved in meta Theorem 13 in Appendix B under general conditions (Al), 
(A2), (A3) and (S). 

Step 1: We first show that the value function V(n) is MLR decreasing (see Appendix A for 
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definition). As shown in the proof, the general conditions (Al), (A2), (A3) are sufficient for 
V(n) to be > r decreasing on H(X). This involves showing that C(n, 1), C(n,2) are MLR 
decreasing. (Al)(i) and (Al)(ii) in the appendix are sufficient conditions for this. The proof that 
(Al)(i) is sufficient for C(tt, 1) to be MLR decreasing is similar in spirit to the Schur-convexity 
proof (Theorem A. 3 of [30]) with the difference that in Schur convexity the vectors n have 
elements in ascending order while in our case the elements of ir can be in any order. 

Conditions (A2) and (A3) are required for MLR monotone updates T(ir, y) of the belief state, 
and also first order stochastic dominance monotonicity of a(Ti,y), see Appendix for definition. 
Step 2: We then prove that Q(ir, u) is submodular on [£(ex, ft), >l x ] and [£(ei, ft), >lJ- (S) is 
sufficient for C(tt,u) to be submodular on lines C(ex,ft) and £(ei,7f). Since we only require 
submodularity on lines C(ex,ft) and £(ei,7r), and these are chains (i.e., totally ordered subsets 
of a partially ordered set), the condition (S) is less restrictive that requiring submodularity on the 
entire simplex Tl(X). Finally (A1),(A2),(A3),(S) are sufficient for Q(tt,u) to be submodular on 
lines C(ex, ft) and £(ei, ft). So Theorem 13 in Appendix B implies a monotone policy on each 
chain [£(e^ , ft), >l x \- So there exists a threshold belief state on each line where the optimal 
policy switches from 1 to 2. (A similar argument holds for lines [£(ei,7r), >lJ). 
Step 3: Step 3 is proved in Appendix C. The entire simplex II (X) can be covered by the union 
of lines C(ex,ft)- The union of the resulting threshold belief states yields the threshold curve 
r(7r). This is illustrated in Fig.l. 

3) Some Intuition: Recall for X = {1,2,3}, the belief state space 11(3) is an equilateral 
triangle. So on 11(3), more insight can be given to visualize what the above theorem says. 5 
In Fig.2, six examples are given of decision regions that violate the theorem. To make these 
examples non-trivial, we have included e\ E TZi in all cases. 

The decision regions in Fig. 2(a) violate the condition that ft* (ft) is increasing on lines towards 
e%. Even though 1Z\ and IZ2 are individually connected regions, the depicted line C(e-s, 7f) 
intersects the boundary of 1Z 2 more than once (and so violates Theorem 1). 

The decision regions in Fig. 2(b) satisfy Theorem 3 since the stopping set TZi is convex. As 

5 The threshold curve F intersects each line segment from vertex e-j, and each line segment from vertex ei at most once. 
This implies V can be parametrized by a pair of monotonically decreasing angles with respect to vertices ei and e^. By 
Lebesgue theorem [42], a monotone function is differentiable almost everywhere. So for X = {1,2,3}, V is differentiable 
almost everywhere. 
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Fig. 1. Illustration of threshold switching decision curve T. Here X = 3 and hence II(Jf) is an equilateral triangle. Theorem 1 
shows that the stopping region 1Z\ is a connected and ei G 1Z\. Also 7?-2 is connected. The lines segments £(ej, 7fi) connecting 
the sub-simplex 11(2) to are defined in (80). Theorem 1 says that the threshold curve F can intersect each line £(ex, ff) only 
once (and similarly intersect each line C(ei, ft) only once). In the special case a — 0, Theorem 3 says that TZi is a convex set. 




(a) Example 1 (b) Example 2 (c) Example 3 




(d) Example 4 (e) Example 5 (f) Example 6 

Fig. 2. Examples of decision regions that violate Theorem 1 on belief space 11(3). 
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mentioned in Sec.III-A, Theorem 1 gives more structure to TZi and 1Z 2 . Indeed, the decision 
regions in Fig. 2(b) violate Theorem 1. They violate the statement that the policy is increasing on 
lines towards e 3 since the boundary of IZi (i.e., threshold curve T) cannot intersect a line from 
e 3 more than once. Therefore, Theorem 1 says a lot more about the structure of the boundary 
than convexity does. In particular, for the PH-distributed change time without variance penalty, 
Theorems 1 and 3 together say that the threshold curve T is convex and cannot intersect a line 
£(e 3 , 7f) or a line £(ei, 7f) more than once. 

Fig. 2(c) also satisfies Theorem 3 since TZi is convex. But the decision regions in Fig. 2(c) 
violate Statement (ii) of Theorem 1. In particular, if e± and e 3 lie in TZi, then e 2 should also 
lie in TZ 1 . Again this reveals that Theorem 1 says a lot more about the structure of the stopping 
region even for the case of zero variance penalty (a = 0). 

Fig. 2(d) also satisfies Theorem 3 since TZi is convex; but does not satisfy Theorem 1 since 
1Z 2 is not a connected set. Indeed when 1Z 2 is not connected as shown in the proof of Theorem 
1, the policy is not monotone on the line £(e 3 , 7f) since it goes from 2 to 1 to 2. 

Fig. 2(e) and (f) violate Theorem 1 since the optimal policy fi*(ir) is not monotone on line 
£(ei, 7r); it goes from 1 to 2 to 1. For the case a = 0, Fig. 2(e) and (f) violate Theorem 3 since 
the stopping region IZi is non-convex. 

Since the conditions of Theorem 1 are sufficient conditions, what happens when they do not 
hold? In Sec.VIII, we will give a numerical example where (S-Exl) is violated and IZi is no 
longer a connected set (Fig. 6(d)). It is straightforward to construct other examples where both 
TZi and 1Z 2 are disconnected regions when the assumptions of Theorem 1 are violated. 

D. Characterization of Optimal Linear Decision Threshold 

This subsection assumes that (Al-Exl), (A2), (A3), (S-Exl) of Sec.III-A hold. So Theorem 1 
applies and computing the optimal policy [f reduces to estimating the threshold curve T. In 
general, any user-defined basis function approximation can be used to parametrize this curve. 
However, any such approximation needs to capture the essential feature of Theorem 1: the 
parametrized optimal policy needs to be MLR increasing on lines. (An identical discussion 
applies to Theorem 2 with assumptions (AS-Exl), (A2), (A3)). 

Below, we derive the optimal linear approximation to the threshold curve T on simplex n(X). 
Such a linear decision threshold has two attractive properties: (i) Estimating it is computationally 
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efficient, (ii) We give conditions on the coefficients of the linear threshold that are necessary 
and sufficient for the resulting policy to be MLR increasing on lines. Due to the necessity and 
sufficiency of the condition, optimizing over the space of linear thresholds on II (X) yields the 
optimal linear approximation to threshold curve T. 
On TL(X), define the linear threshold policy fie(^) as 



stop 



if 



1 9' 



71 



1 



< 



7i £ U(X). 



(32) 



continue = 2 otherwise 

Here 9 = (0(1), . . . ,9(X — 1))' £ IR X_1 denotes the parameter vector of the linear threshold 
policy. (Since H(X) C M x ~ 1 , a linear hyperplane on II(X) is parametrized by X— 1 coefficients). 

Theorem 4 below characterizes the optimal linear decision threshold approximation to the 
threshold curve on H(X). Assume conditions (A-Exl), (A2), (A3), (S-Exl) hold for the quickest 
detection problem (20) so that from Theorem 1 , the optimal policy fi* (n) is MLR increasing on 
lines C(e x ,Tt) and £(ei,7f). Assume the conditions of Lemma 1 hold, so that one only needs 
to search for the optimal linear threshold in the interior of II (X). Finally, the requirement that 
ei lies in the stopping set, means fie(ei) < which implies 9(X — 1) > 0. 

Theorem 4 (Optimal Linear Threshold Policy): For belief states tx £ Y1(X), the linear thresh- 
old policy fig(n) defined in (32) is 

(i) MLR increasing on lines C(e x , it) iff 9(X - 2) > 1 and 9(i) < 9(X - 2) for i < X - 2. 

(ii) MLR increasing on lines C(ei, if) iff 9(i) > 0, for i < X — 2. ■ 
The proof of Theorem 4 is in Appendix E. As a consequence of Theorem 4, the optimal linear 

threshold approximation to threshold curve T of Theorem 1 is the solution of the following 
constrained optimization problem: 

9* = arg min J^ e (n ), subject to < 9(i) < 9(X - 2), 9(X - 2) > 1 and 9(X - 1) > 



(33) 

where the cost J^ttq) is obtained as in (20) by applying threshold policy fi e in (32). 
Remark: The constraints in (33) are necessary and sufficient for the linear threshold policy (32) 
to be MLR increasing on lines C(e x , it) an £(ei,7f). That is, (33) defines the set of all MLR 
increasing linear threshold policies - it does not leave out any MLR increasing polices; nor 



June 14, 201 1 



DRAFT 



23 




does it include any non MLR increasing policies. Therefore optimizing over the space of MLR 
increasing linear threshold policies yields the optimal linear approximation to threshold curve T. 

Intuition: Consider X = 3, X = {1,2,3} so that the belief space n(X) is an equilateral 
triangle. Then with (u>(l), w(2)) denoting Cartesian coordinates in the equilateral triangle, clearly 
7r(2) = 2uj(2)/\/3, 7r(l) = cj(1) — cu(2)/v3 and the linear threshold satisfies 

So the conclusion of Theorem 4 that 0(1) > 1 implies that the linear MLR increasing threshold 
has slope of 60° or larger. For 0(1) > 2, it follows from (34) that the slope of the linear threshold 
becomes negative, i.e., more than 90°. For a non-degenerate threshold, the u(l) intercept of the 
line should lie in [0, 1] implying 0(1) > 9(2) and 9(2) > 0. Fig. 3 illustrates these results. Fig. 3(a) 
and (b) illustrate valid linear thresholds. In Fig. 3(a) and (b), the conditions of Theorem 4 hold 
(the slope is larger than 60° and tu(l) intercept is in [0, 1]). Fig. 3(c) shows an invalid threshold 
(since the slope is smaller than 60° and cu(l) intercept lies outside [0, 1]). In other words, Fig. 3(c) 
shows an invalid threshold since it violates the requirement that Ho(tt) is decreasing on lines 
towards e 3 on 11(3). (A line segment £(e 3 ,7f) starting from some point n on facet (e 2 , e\) and 
connected to e 3 would start in the region u = 2 and then go to region u = 1. This violates the 
requirement that ^e(n) is increasing on lines towards e 3 ). 

E. Algorithm to compute the optimal linear decision curve 

In this section a stochastic approximation algorithm is presented to estimate the optimal 
threshold vector 9* in (33). Because the cost J Me (vr ) in (33) cannot be computed in closed 
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form, we resort to simulation based stochastic optimization. Let n = 1, 2 . . . , denote iterations 
of the algorithm. The aim is to solve the following linearly constrained stochastic optimization 
problem: 

Compute 9* = argminE{ J„(/x fl )} subject to < d(i) < 6(X - 2), 6(X - 2) > 1 and d(X - 1) > 0. 

(35) 

Here, for each initial condition 7r , the sample path cost J n (^e,^o) is evaluated as 

oo 

J n {y>o, tto) = ^ p k ~ l C{ir k , u k ) where u k = ^ie{^k) is computed via (32) (36) 

k=l 
1 L 

JniHe) — y ^ Jn(^e,^o ) where prior 7To is sampled uniformly from simplex H(X). 

i=i 

A convenient way of sampling uniformly from U.(X) is to use the Dirichlet distribution (i.e., 
7To(i) = Xi/^2 { Xi, where Xi ~ unit exponential distribution). 

The above constrained stochastic optimization problem can be solved by a variety of meth- 
ods. One method is to convert it into an equivalent unconstrained problem via the following 
parametrization: Let <\> = (0(1), . . . 4>(X — 1))' G M^" 1 and parametrize 9 as 



i),...,e<f>(x 



where 



<p 2 {X - 1) i = X -1 

l + 2 (X-2) i = X-2 



(1 + 2 (X - 2)) sm 2 ((f)(t)) % = 1, . . . , X - 3 

(37) 

Then 9^ trivially satisfies constraints in (35). So (35) is equivalent to the following unconstrained 
stochastic optimization problem: 

Compute fJL<t,*M where 0* = arg min E{J„(0)} and 

J n (4>) is computed using (36) with policy ^(n) evaluated according to (37). (38) 

Algorithm 1 below, uses the Simultaneous Perturbation Stochastic Approximation (SPSA) 
algorithm [47] to generate a sequence of estimates <p n , n — 1,2, ... , that converges to a local 
minimum of the optimal linear threshold 0* with policy //0*(7r). 

The above SPSA algorithm [47] picks a single random direction cu n along which direction the 
derivative is evaluated at each batch n. Unlike the Kiefer-Wolfowitz finite difference algorithm 
to evaluate the gradient estimate V^Jn in (39), SPSA requires only 2 batch simulations, i.e., 
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Algorithm 1 Policy Gradient Algorithm for computing optimal linear threshold policy 
Assume (Al-Exl), (A2), (A3), (S-Exl) hold so that the optimal social policy is characterized 

by a threshold switching curve in Theorem 1. 



Here A n = A/(n + l) 7 denotes the gradient step size with 0.5 < 7 < 1 and A > 0. 
• Update threshold coefficients 4> n via stochastic approximation algorithm 



the number of evaluations is independent of dimension of parameter 0. Because the stochastic 
gradient algorithm (39) converges to local optima, it is necessary to try several initial conditions 
0o- The computational cost at each iteration is linear in the dimension of 6 and is independent 
of the observation alphabet size. 

For fixed 9, the samples J n (fie) in (36) are simulated independently and have identical 
distribution. Thus the proof that 9 n = 6^ n generated by Algorithm 1 converges to a local optimum 
of E{ J n (/ie)} (defined in (35)) with probability one, is a straightforward application of techniques 
in [26] (which gives general convergence methods for Markovian dependencies). 
Remark: More sophisticated gradient estimation methods can be used instead of the SPSA 
finite difference algorithm given here. For example, [23], [35] present score function and weak 
derivative approaches for estimating the gradient of a Markov process with respect to a policy. In 
[4] the score function method is used to perform gradient-based reinforcement learning. These 
algorithms are applicable to solve the constrained stochastic optimization problem (35) thereby 
yielding the optimal linear threshold policy. If the change time distribution (specified by P) and 
the observation likelihoods (specified by B) are not completely specified, but (A2) and (A3) 
hold, Theorem 1 applies and reinforcement learning algorithms [4] can be used to solve (35). 
Moreover [25], [54] analyze the tracking properties of stochastic approximation algorithms when 



Step 1 : Choose initial threshold coefficients O and linear threshold policy // i . 
Step 2: For iterations n — 0, 1, 2, . . . 

• Evaluate sample cost J n ((p n ) using (38). Compute gradient estimate 




4>n+l — 4>n — £n+l^ 4>Jn{<f>n)-, 



e/(n + l + s) c , 0.5<C<1, ande,s>0. (39) 
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7T = e 3 . (40) 



the transition and observation matrices and time varying. 

IV. Example 2: Quickest Transient Detection with variance penalty 

Our second example deals with Bayesian quickest transient detection. 6 We show below that 
under similar assumptions to quickest time detection, the threshold switching curve of Theorem 1 
holds. Therefore the linear threshold results and Algorithm 1 hold. 

The set up is identical to Sec. II with state space X = {1,2,3}. The transition probability 
matrix and initial distribution are 

1 

P = P21 P22 
P32 P33 

So the Markov chain starts in state 3. After some geometrically distributed time it jumps to the 
transient state 2. Finally after residing in state 2 for some geometrically distributed time, it then 
jumps to the absorbing state 1. 

In quickest transient detection, we are interested in detecting transition to state 2 with minimum 
cost. The action space is U = {1 (stop), 2 (continue)}. The stop action u — 1 declares that 
transient state 2 was visited. 

We choose the following costs (see [39] for other choices). Similar to (18), let dil(x k = 
e», Uk = 2) denote the delay cost in state e,, i E X. Of course d 3 = since x k = e 3 implies that 
the transient state has not yet been visited. So the expected delay cost is 

^2 di^{x k = e u u k = 2\F k } = dV fc where d = {d x , d 2 , d 3 )', d 3 = 0. (41) 

iex 

Typically the elements of the delay vector d are chosen as d\ > d 2 > so that state 1 (final 
state) accrues a larger delay than the transient state. This gives incentive to declare that transient 
state 2 was visited when the current state is 2, rather than wait until the process reaches state 1. 

The false alarm cost for declaring u = 1 (transient state 2 was visited) when x = 1 is zero 
since the final state 1 could only have been reached after visiting transient state 2. So the false 
alarm penalty is E{/(xfc = e 3 , u k = l)|J-fe} = 1 — (ei + e 2 )'7r fc for action u k = 1. For convenience, 

s The author gratefully acknowledges Dr. Venugopal Veeravalli at U. Illinois for describing quickest transient detection and 
giving access to the preprint [39]. 
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in the variance penalty (14), we choose g = [0, 0, 1]'. So from (16), (18), the expected stopping 
cost and continuing costs are 

C(tt, 1) = a(g'n - (g'n) 2 ) + (3(1 - {e x + e 2 )V), C(ir, 2) = dV. (42) 

The optimal decision policy ^*(tt) and stopping set IZi are as in (24), (25) and (26). 

Main Result. The following assumptions are similar to those in quickest time detection. Note 
that due to its structure, P in (40) is always TP2 (i.e., (A3) in Sec.III-A holds). 
(S-Ex2) The scaling factor for the variance penalty satisfies a < rf2 ^^ 3 - 

For a = (zero variance penalty) (S-Ex2) holds trivially. 

Theorem 5: Consider the quickest transient detection problem with delay and stopping costs in 
(42). Then under (A2), (40), (S-Ex2), the conclusions of Theorem 1 hold. Also if the observation 
likelihoods are non-zero, then for k > 2, n k e U°(X) and the threshold is non-degenerate, see 
Sec.III-B3 and Lemma 1. Thus Algorithm 1 estimates the optimal linear threshold. (The proof 
follows from meta Theorem 13 in Appendix B and Theorem 1). ■ 

PH-distributed change times: More generally, suppose the process x jumps after a PH- 
distributed time to transient state. Then after another PH-distributed time period, it jumps to 
the absorbing state. We show that Theorem 5 continues to hold. 

To model the two PH-distributed change times, let 1 denote the absorbing state, T = {2, . . . , X\+ 
1} denote the set of transient states and § = {X\ + 2, . . . , X\ + X 2 + 1} denote set of starting 
states. Define the (X\ + X 2 + 1) x (X\ + X 2 + 1) transition matrix 



P 



Ex 1 xX 1 Xl xX 2 

Ox, P 



(43) 



>X 2 ^{X 1 +X 2 -l)x.X 2 

Suppose the Markov chain starts in §. Then after a PH-distributed time it jumps to T and finally 
after another PH-distributed time, jumps to state 1. Just as in Sec.II, P and P determine the 
PH-distribution in the start and transient states, respectively. Let d and f denote the delay and 
false alarm vectors, d is a vector with decreasing elements with dj are di = 0, i G S. f is a vector 
with increasing elements with fa — 0, i ^ S. The delay and stopping costs are C(rc, 1) = /3? it, 
C(ir, 2) = d'n. Then the conclusions of Theorem 5 hold under (A2), (A3) if the decision maker 
designs f and d to satisfy the following linear constraints: 

f Xl+2 > 1, (d + (3(pP - J)f)' ( ei - e m ) > 0, i = l,...,X 1 + X 2 . (44) 
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The decision maker can design suitable f and d satisfying (44) using a linear programming 
solver. 

Remark: In the formulation of [39], it is assumed that states 1 and 3 are indistinguishable in 
terms of observations, i.e., B ly = B 3y for all y E Y. In this case, obviously (A2) does not hold. 
At this stage, we are unable to prove the structural result of Theorem 5 when (A2) does not 
hold. (Relabelling state 1 as 2 and state 2 as 1 does not work. Then B satisfies (A2) but the 
transition matrix P is longer TP2). Nevertheless, we have the following result which follows 
from Theorem 3. 

Corollary 1: For a = 0, (no variance penalty), then the stopping region 1Z\ in quickest 
transient detection is a convex subset of n(X). ■ 

V. Example 3: Quickest Detection with Exponential Penalty for Delay 

In this example, we generalize the results of Poor [36], which deals with exponential delay 
penalty and geometric change times. We consider exponential delay penalty with PH-distributed 
change time. Our formulation involves risk sensitive partially observed stochastic control, see 
Sec. I for motivation. We first show that the exponential penalty cost function in [36] is a special 
case of risk-sensitive stochastic control cost function when the state space dimension X = 2. 
We then use the risk-sensitive stochastic control formulation to derive structural results for PH- 
distributed change time. In particular, the main result below (Theorem 6) shows that the threshold 
switching curve still characterizes the optimal stopping region TZ 1 . The assumptions and main 
results are conceptually similar to Theorem 1. 

Since our aim is to interpret and extend the results of [36] using risk sensitive control, we 
consider the same costs as in [36], so a = (no variance penalty). Below, we will use c(e i; u — 1) 
to denote false alarm costs and c(ej,w = 2) to denote delay costs, where j € X. 

Risk sensitive control [6] considers the exponential cost function 



where e > is the risk sensitive parameter. 

Let us first show that the exponential penalty cost in [36] is a special case of (45) for consider 
the case X = 2 (geometric distributed change time). For the state x G {ei, e 2 }, choose c(x, u = 
1) = (3I(x 7^ ei,u = 1) = (3(1 — e[x) (false alarm cost) , c(x,u = 2) = dl(x — e\,u — 




k=l 



(45) 
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2) = de[x (delay cost). Then it is easily seen that J2k=i c ( x fc> u k = 2) + c(x T ,u T = 1) = 
d \t — t°| + + /3I(t < r°). Therefore (recall r° is defined in (6) and r is defined in (19)), 

J^ Q ) = E^ o {exp(erf |r - r°|+ + e/3/(r < r )) } [i(r < r°) + 7(r = r°) + J(r > r )] 
= E£ o {exp(e/3)/(r < r°) + exp(ed |r - r°| + )/(r > r°) + 1} 
= E% o {(e efi - 1)1 (r < t°) + e^ T ~ T 

= (e^ - l)P£ (r < r°) + E^{e«^ (46) 

which is identical to Poor's exponential delay cost function [36, Eq.40]. Thus the Bayesian 
quickest time detection with exponential delay penalty in [36] is a special case of a risk sensitive 
stochastic control problem. 

We consider the delay cost as in (17); so for state x 6 {ei, . . . , e x }, c(x, u k = 2) = de^P'x. 
To get an intuitive feel for this modified delay cost function, for the case X = 2, 

T-l 

c ( x k, u k = 2) + c{x T , u T = 1) = d\r - r°|+ + 0I(r < r°) + dP 21 (r° - l)/(r° < r) 

fc=i 

Therefore, by using (21), for X = 2, the exponential delay cost function is 

J^o) = ( e e ^ - l)P£ o ( r < r°) + E£ o {e ed \} T ~ T °l + +fl»ifr>-i)'fr><T)] j (4?) 

This is similar to (46) except for the additional term P2i( r o — ^)H T o < T ) m the exponential. 

With the above motivation, in the rest of this section we consider risk sensitive quickest time 
detection for PH-distributed change time, i.e. X > 2. Let ir denote the risk sensitive belief state, 
see [14], [17] for extensive descriptions of the risk sensitive belief state and verification theorems 
for dynamic programming in risk sensitive control. It can be shown [14] that the value function 
V(tt) satisfies 

V(ir) = min{C , (vr, 1), ]T V(T(tt, Y))a(n, y)} (48) 

2/eY 

where with R 1 = (1, e £/3 , . . . , e e/3 )', R 2 = (e td , e edP2 \. . . , e tdPxi )', B y defined in (11) 

C(tt, 1) = R[n, T(tt, y) = A^iag^K ^ y) = rB P ' diag{R2)n (49) 

As in Sec.II-B, define V(n) = V(n) — C(ir, 1). Then V(ir) satisfies Bellman's equation (25) 
with 

C(tt, 1) = 0, C(tt, 2) = J R' 1 (P / diag( J R 2 ) - i>. (50) 
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Assume the following condition holds 
(Al-Ex3) The elements of R[(P' diag(i? 2 ) — I) are decreasing wrt i — 1, 2, . . . , X. 
Evaluating C(ir,2) = -R' 1 (P / diag(i? 2 ) — -O 71 "* then (Al-Ex3) is equivalent to 

e «*_l > e «ifti(p 21+e ^(i_p 21 ))_ e ^ and e «*Pa(p. 1+e tf (i_p a )) decreasing in i 6 {2, . . . , X} 

For example, if d — e — 1, then for > 1, the following are verified by elementary calculus: 

(i) (Al-Ex3) always holds for (3 > 1 when X = 2 (geometric distributed change time). 

(ii) For PH-distributed change time, if (A3) holds, then (Al-Ex3) always holds providing P 2 i < 

1/(^-1). 

Theorem 6: The stopping region IZi is a convex subset of n(X). Under (Al-Ex3), (A2), (A3), 
Theorem 1 holds. Thus Algorithm 1 estimates the optimal linear threshold. ■ 

The proof is in Appendix F. 
Remarks: (i) Delay Formulation in [36]: Consider the formulation in Poor [36] which is equiv- 
alent to (46). Then for the geometric distributed case X = 2, the convexity of IZi holds using 
a similar proof to above. Since II (X) is a 1 -dimensional simplex and e\ G 72.1, convexity 
implies there exists (a possible degenerate) threshold point ix* that characterizes 1Z\ such that 
the optimal policy is of the form (1). As a sanity check, the analogous condition to (Al-Ex3) 
reads e ed — 1 > P 2 i(l — e<:/3 )- This always holds for e > 0. Therefore, assuming (A2) holds, the 
above theorem holds for Poor's [36] exponential delay penalty case under (A2). (Recall (A3) 
holds trivially when X = 2). Finally, for X > 2, using a similar proof (see Theorem 2), one 
can again show that the conclusions of Theorem 6 hold. 

(ii) Other Examples: With the above risk sensitive formulation, the dynamic programming 
equation (48) for the exponential delay case is very similar to the other examples in this paper. 
Therefore, it is straightforward to generalize the above exponential penalty result to quickest 
transient detection (of Sec. IV), and social learning stopping time problems considered below. 

VI. Example 4 & 5: Stopping Time Problems in Multi- agent Social Learning 

Here we consider stopping time problems in multi-agent social learning. We present two 
results: 

(i) Sec.VI-B (Example 4) considers a multi-agent system seeking to solve a Bayesian stopping 
time problem. 
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(ii) Sec.VI-C (Example 5) deals with constrained optimal social learning which is formulated 
in Chamley [11] as a sequential stopping time problem. We show that the optimal policy has a 
threshold switching curve similar to Theorem 1. 

A. Motivation: Social Learning amongst myopic agents 

Since social learning only serves as a motivation for subsequent subsections, our description 
is brief; see [11]. Consider a countable infinite number of agents performing social learning to 
estimate an underlying random state x. Each agent acts once in a predetermined sequential order 
indexed by k = 1,2,.... One can also view k as the discrete time instant when agent k acts. 
A key difference between social learning compared to the formulation in previous sections is 
that agent k does not have access to the belief state or private observations of previous agents. 
Instead each agent k only has access to the actions taken by previous agents together with its 
own current private observation y k . 

Throughout this section, we assume that the observation space Y is finite. Let y k E Y = 
{1,2, ... ,Y} denote the private observation of agent k and a k E A = {1,2, , . . . , A} denote the 
action agent k takes. Define the sigma algebras: 

74 a-algebra generated by (ai, . . . , a fc _i, y k ), 
Q k cr-algebra generated by (ai, . . . , a k -i, a k ). (51) 

The social learning model [8], [11] comprises of the following ingredients: 

(i) The state of nature x as in Sec.II-A except that the transition matrix is P — I. That is, the 
state of nature is a random variable with distribution ir (see (4)) instead of a random process. 

(ii) At time k, agent k records a private observation y k E Y from the observation distribution 

B iy = P{y\x = ei), i E X. 

(iii) Private belief: Using the public belief it k ~i available at time k — 1 (defined in Step (v) 
below), agent k then updates its Bayesian private belief rj k as in (11) with P — I. Here 

r] k = E{x\U k } = (r) k (i), ieX), r) k (i) = P(x = ei\a u a k _ x , y k ), initialized with tt . 

(52) 

(iv) Myopic Action: Agent k then takes action a k G A = {1,2, , . . . , A] to minimize myopically 
its expected cost a k = argmin aGA {c' a ?7 fc }. Here c a = (c(ej,a),i e X) denotes an X dimensional 
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cost vector, and c(e i; a) denotes the cost incurred when the underlying state is ej and the agent 
picks action a. Thus agent k chooses action 

a k = a(7r fc _i,j/ fc ) = argminE{c(a;,a)|?4} = argmin{c^ fe } (53) 

agA agA 

(v) Social learning and Public belief: Finally agent k broadcasts this action a k to subsequent 
agents. Define the public belief ix k as the posterior distribution of the state x given all actions 
taken up to time k. 

n k = K{x\Qk} = (7Tfc(*)j i X), = = e i|°i5 • • • a fc)' initialized with 7r (54) 

Based on the action a k every agent (apart from k) perform social learning to update their public 
belief according to the following "social learning Bayesian filter": 

FT it 

7T fc = T(vr fc _ 1 , a k ), where T(vr, a) = a a(ir, a) = 1' x R%k (55) 

(J{7i, a ) 

In (55), W a = diag(P(a|x = ej,7r),i G X) with elements 

P{a k = a\x = ei, 7r fc _i = 7r) = ^ P(a fc = a|y, 7r)P(y|x = e*) (56) 



- II tKBy* < c 'aB y K)P(y\x = ei) 

y£Y agA-{a} 

where /(•) denotes the indicator function and B y is defined in (11). 

The following well known result [8], [11] states that eventually after some finite time k, all 
agents pick the same action and the private belief freezes. This is termed an information cascade. 
The proof follows via an elementary application of the martingale convergence theorem. 

Theorem 7 ([8]): The above social learning model leads to an information cascade (i.e., all 
agents herd) in finite time with probability 1. That is there exists a finite time k after which 
social learning ceases, i..e, public belief Hk+i = ^k, k >k, and all agents pick the same action, 
i.e., a k+ i = a k , k>k. ■ 

B. Example 4: Sequential Detection with Social Learning 

Suppose a multi-agent system makes local decisions and performs social learning as above. 
Given such a protocol, how can the multi-agent system make a global decision when to stop? 
As mentioned in Sec.I, such problems are motivated in decision systems where a global decision 
needs to be made based on local decisions of agents. 
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We consider a Bayesian sequential detection problem for state x = e\. The main result below 
(Theorem 8) is that the global decision of when to stop is a multi-threshold function of the belief 
state. This unusual behavior is because in social learning, the action likelihood probabilities 
in (56) depend on the belief state w. 

Consider X = Y = {1,2} and A = {1, 2} and the social learning model of Sec.VI-A, where 
the costs c{ei,a) satisfy 

c(ei,l) < c(ei,2), c(e 2 , 2) < c(e 2 , 1). (57) 

Otherwise one action will always dominate the other action and the problem is un-interesting. 
Redefine the sigma algebras in (51) to include the action history: 

U k cr-algebra generated by (ai, . . . , a fc _i, y k , u u . . . , u k ), 
Q k a-algebra generated by (eti, . . . , a k -i, a k , u u . . . , u k ). (58) 

Let r denote a stopping time adapted to the sequence of sigma-algebras Q k , k > 1 (see (58)). 
In words, each agent has only the public belief obtained via social learning to make the global 
decision of whether to continue or stop. The goal is to solve the following sequential detection 
problem to detect state e\. Pick the stopping time r to minimize 

J^tto) = E£ o j^^E-^/Or = e x )\Q h ^} + p T ~ 1 f3E{I{x + e 1 )|^_ 1 }| . (59) 

As in previous sections, the first term is the delay cost and penalizes the decision of choosing 
u k = 2 (continue) when the state is e\ by the non-negative constant d. The second term is the 
stopping cost incurred by choosing u T — 1 (stop and declare state 1) at time k = r. It is the 
error probability of declaring state e\ when the actual state is e 2 . In terms of the public belief, 

T-l 

J m (tt ) = W^^p k - l C(jx k ^u k = 2)+p T - 1 C(n T . 1 ,u T = 1)} (60) 

k=l 

0(71, 2) = de[7T, C(tt, 1) = /3e' 2 7T. 

The global decision u k = p,(7i k -i) G {1 (stop) , 2 (continue)} is a function of the public belief 
7Tfc„i updated according to the social learning protocol (52), (55). The optimal policy p*(ix) and 
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value function V{ix) satisfy Bellman's equation (25) with 



Q(tt, 2) = C(tt, 2) + p V (T(vr, a)) a(n, a) where 



(61) 



C(tt, 2) = (7(tt, 2) - (1 - p)C(vr, 1), Q(tt, 1) = C(tt, 1) = 0. 



Here T(tt, a) and a(ir, a) are obtained from the social learning Bayesian filter (55). 

Since X = {1,2}, the public belief state n = [1 — 7r(2), 7r(2)]' is parametrized by the scalar 
7r(2) G [0, 1], i.e., n(X) is the interval [0, 1]. In order to state the main result, define the following 
four intervals which form a partition of the interval [0,1]: 



Vi = {tt(2) : rti < tt(2) < 77,-1}, 1 = 1,..., 4 where 

= 1 = (c(e 1 ,2)-c(e 1 ,l))B n 

V ° ' % (c(e l7 2) - c(e 1; l))B n + (c(e 2 , 1) - c(e 2 , 2))£? 21 

= (c(ei 1 2^-c(ei 1 l)) 

772 (c(ei, 2) - c( eij 1)) + (c(e 2 , 1) - c(e 2 , 2)) 

= (c(ei 1 2)-c(ei 1 l))i^ 

^ 3 (c( ei , 2) - c( ei , 1))S 12 + (c(e 2 , 1) - c(e 2 , 2))5 22 ' V4 " 



Note that rj corresponds to belief state e 2 , and 774 corresponds to belief state e±. (See discussion 
at the end of this section for more intuition about the intervals Vi). 

It is readily verified that if the observation matrix B is TP2, then 773 < 772 < 771. The following 
is the main result. 

Theorem 8: Consider the stopping time problem (60) where agents perform social learning 
using the social learning Bayesian filter (55). Assume (57), d > p(3, and B is TP2 symmetric. 
Then the optimal stopping policy (J>*(ir) has the following structure: The stopping set 1Z± is the 
union of at most three intervals. That is 1Z± = 1Z1 U 1Z\ U 1Z\ where 1Z\, VS X are possibly 
empty intervals. Here 

(i) The stopping interval VJ[ C V\ U T± and is characterized by a threshold point. That is, if V\ 
has a threshold point ir*, then p*(ir) = 1 for all 7r(2) G P4 and 



1 otherwise 

Similarly, if V4 has a threshold point 7r|, then /i*(vr) = 2 for all 7r(2) G V\. 
(ii) The stopping intervals 1Z\ C ^ and 72-i C 




(62) 
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(c) Policy /u*(7r) (d) Value function V(n) 

Fig. 4. Double Threshold Policy in stopping time problem involving social learning. The parameters are specified in (63) and 
(65). Fig.4(a) and (c) show the optimal policy ^*(vr) and Fig.4 (b) and (d) show the value function V(tv) computed using (61) 



(iii) The intervals V\ and "P 4 are regions of information cascades. That is, if %k G V\ U V4, then 
social learning ceases and 7r fc+1 = ix k (see Theorem 7 for definition of information cascade). ■ 

The proof of Theorem 8 is in Appendix G. The proof depends on properties of the social 
learning filter and these are summarized in Lemma 7 in Appendix G. 

Examples: (i) To illustrate the multiple threshold structure of the above theorem, consider the 
stopping time problem (60) with the following parameters: 



p = 0.9, d=1.8, B 



0.9 0.1 
0.1 0.9 



4.57 5.57 
2.57 



/3 = 2. 



(63) 



Fig.4(a) and (b) show the optimal policy and value function. These were computed by con- 
structing a grid of 500 values for II (X) = [0, 1]. The double threshold behavior of the stopping 
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time problem when agents perform social learning is due to the discontinuous dynamics of the 
Bayesian social learning filter (55). 

(ii) Consider the following generalization of the sequential detection problem (60). In addition 
to the delay and error probability costs, we consider the total social learning cost incurred by 
all the agents. So now instead of (59) we have 

r-l 



{T — L 
Y^p k ~ 1 ^{™^{c{x,a)\H k } + dI{x = ei)| Gk-i] 



+ / o r - 1 /3E{J(x ^ e x )\Q T -i} + p r-1 E{mmE{c(x, a)\Ur}\S T -i} j ■ (64) 

Here Q k , Hk are defined in (58). The first and last terms above constitute the total social learning 
cost (53) from time 1 to r. In terms of the public belief 

E{mmE{c(x, a)\n k }\Q k -i} = Y] min d a T(ir k -u y)a(ir k -i, y) = V^min d a B v ir k -i. 

yGY yeY 

Then it can be shown that Theorem 8 holds providing c(ej, a) is decreasing in i. We chose the 
following parameters: Last term in (64) set to zero, 



P 



0.9, d=l, B 



0.9 


0.1 




2.1 


3.1 






, c(ei,a) = 






0.1 


0.9 




3.1 


0.53 



= 20. 



(65) 



Fig. 4(c) and (d) show the optimal policy and value function. The optimal stopping policy is 
again a double threshold, and the value function is monotone on individual intervals. 

Discussion: The multiple threshold behavior (nonconvex stopping set TZ{) of Theorem 8 is 
unusual. One would have thought that if it was optimal to 'continue' for a particular belief 
7r*(2), then it should be optimal to continue for all beliefs 7r(2) larger than 7T*(2). The multiple 
threshold optimal policy shows that this is not true. Fig.4(a) shows that as the public belief 7r(2) 
of state 2 decreases, the optimal decision switches from 'continue' to 'stop' to 'continue' and 
finally 'stop'. Thus the global decision (stop or continue) is a non-monotone function of public 
beliefs obtained from local decisions. 

The main reason for this unusual behavior is the dependence of the action likelihood B% on 
the belief state n. This causes the social learning Bayesian filter to have a discontinuous update. 
The value function is no longer concave on H(X) and the optimal policy is not necessarily 
monotone. As shown in the proof of Theorem 8, the value function V(n) is concave on each of 
the intervals V\, I — 1, . . . , 4. 
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To explain the final claim of the theorem, let us define the intervals V\ and V4 more explicitly: 

Vi = {tt : minc^Tr = 2, \/y £ Y} (66) 

a 

V 2 = {it: mmc' a B y 7r = 1, My £ Y}. 

a 

For public belief ?r £ ?i, the optimal local action is a = 2 irrespective of the observation 
y; similarly for ir £ P4, the optimal local action is a = 1 irrespective of the observation ?/. 
Therefore, on intervals V\ and P 4 , there is no social learning since the local action a reveals 
nothing about the observation y to subsequent agents. Social learning only takes place when the 
public belief is in P 2 and V3. 

Finally, we comment on the intervals Vi, I = 1, . . . , 4. They form a partition of H(X) such 
that if 7r £ Vi, then T(n, 1) £ V1+1 and T(tt,2) £ V\-\ (with obvious modifications for / = 1 
and / = 4), see Lemma 7 in the appendix for details. In fact, r\\ and 773 are fixed points of the 
composition Bayesian maps: r\\ = T(T(r)i, 1), 2) and 773 = T(T(r/ 3 , 2), 1). Given that the updates 
of the social Bayesian filter can be localized to specific intervals, we can then inductively prove 
that the value function is concave on each such interval. This is the main idea behind Theorem 8. 

C. Example 5: Constrained Social Optimum and Sequential Detection 

In this subsection, we consider the constrained social optimum formulation in Chamley [11, 
Chapter 4.5]. We show in Theorem 9 that the resulting stopping time problem has a threshold 
switching curve. Thus Chamley's optimal social learning can be implemented efficiently in a 
multi-agent system. This is in contrast to the multi-threshold behavior of the stopping time 
problem in Sec.VI-B when agents were selfish in choosing their local actions. 

The constrained social optimum formulation in [11] is motivated by the following question 7 : 
How can agents aid social learning by acting benevolently and choosing their action to sacrifice 
their local cost but optimize a social welfare cost? In Sec.VI-B, agents ignore the information 
benefit their action provides to others resulting in information cascades where social learning 
stops. By constraining the choice by which agents pick their local action to two specific decision 
rules, the optimal choice between the two rules becomes a sequential decision problem. 

7 The author gratefully acknowledges discussions with Dr C. Chamley who authored the book [11]. The constrained social 
optimum formulation presented in this section is due to Chamley and is presented in [11]. Our formulation considers a multi-state 
version together with sequential detection of state e\. 
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Assume A = Y. As in the social learning model above, let c a denote the cost vector for 
picking local action a and r] k (see (52)) denote the private belief of agent k. 

Let r denote a stopping time adapted to the sequence of sigma- algebras Gk, k > 1 defined in 
(58). As in Sec.VI-B, the goal is to solve the following sequential detection problem to detect 
state e\\ Pick the stopping time r to minimize 



{r-1 r-1 
k=l k=l 

+ p T ^E^I(x ^ ei)|&_i} + ^mmE{c(x,a)\g T A. (67) 

1 — p aGY J 

Similar to (64), the second and third terms are the delay cost and error probability in stopping 
and announcing state e\. The first and last terms model the total social welfare cost involving all 
agents based on their local action. Let us explain these two terms. The key difference compared 
to (64) is that agents now pick their local action according to the decision rule a(ir,y, p(tt))) 
(instead of myopically) as follows. As in [11], we constrain decision rule a(ir,y, p(ir))) to two 
possible modes: 

{y k if u k = /i(7Tfc_i) = 2 (reveal observation) 

(68) 
argmin a c' a 7Tfe_i if u k = p(ir k -i) = 1 (stop) . 

Here the stationary policy p : n k -\ — > u k specifies which one of the two modes the benevolent 

agent k chooses. In mode u k = 2, the agent k sacrifices is immediate cost c(x, a k ) and picks 

action a k = y k to reveal full information to subsequent agents, thereby enhancing social learning. 

In mode u k = 1 the agent 'stops and announces state 1'. Equivalently, using the terminology 

of [11], the agent "herds" in mode u k = 1. It ignores its private observation y k , and chooses 

its action selfishly to minimizes its cost given the public belief n k -i. So agent k chooses a k = 

argmin a c' a n k _i = a k ^\. Then clearly from (56), P(a\e x , it) is functionally independent of x since 

P(y\x = ei) is independent of i. Therefore from (55), if agent k herds, then ir k = T{ix k ^i, a k ) = 

ir k -i, i.e., the public belief remains frozen. The total cost incurred in herding is then equivalent 

to final term in (67): 



E/ _1 minE{c(a;,fl)|5n} = > p k ~ l min d a -K T -i = min<vr T _i. 
aGY • a 1 — pa 

k=r k=r 

Define the constrained social optimal policy p* such that J^^o) = inf^g^ J^hq). The 
sequential stopping problem (67) seeks to determine the optimal policy p* to achieve the optimal 
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tradeoff between stopping and announcing state 1 and the cost incurred by agents that are acting 
benevolently. In analogy to Theorem 1, we show that (J>*(ir) is characterized by a threshold curve. 
Similar to (24) define the costs in terms of the belief state as 

C(tt, 1) = — ?— min c>, C(tt, 2) = V c' y B y n + (d+(l- p)f3)e[n - (1 - (69) 

Below we list the assumptions and main structural result which is similar to Theorem 1 . These 
assumptions involve the social learning cost c(ej, a) and are not required if these costs are zero. 
(Al-Ex5) c(ej, a) — c(e i+1 , a) > 
(S-Ex5) (i) c(e x , a) - c(e i; a) > (1 - p) Y^ y ( c ( e x, - c(e i5 |/)-B iy ) 

(ii)(l - p) Y. y (c(ei, - c(e<, > c(ei, a) - c(e i; a). 

Similar to the discussion in Sec.III-C, (Al-Ex5) is sufficient for C(ir,l) and C(ir,2) to be 
> r decreasing in ix E IT(X). This implies that the costs C(ei,u) are decreasing in i, i.e., state 
1 is the most costly state. 

(S-Ex5) is sufficient for C(ir, u) to be submodular. It implies that C(&i,2) — C(&i, 1) is 
decreasing in i. This gives economic incentive for agents to herd when approaching the state 
e\, since the differential cost between continuing and stopping is largest for e±. Intuitively, the 
decision to stop (herd) should be made when the state estimate is sufficiently accurate so that 
revealing private observations is no longer required. 

Theorem 9: Consider the sequential detection problem for state e 1 with social welfare cost in 
(67) and constrained decision rule (68). Then: 

(i) Under (Al-Ex5), (A2), (S-Ex5), constrained social optimal policy p*(n) satisfies the structural 
properties of Theorem 1. (Thus a threshold switching curve exists). 

(ii) The stopping set IZi is the union of |Y| convex sets (where |Y| denotes cardinality of Y). 
Note also that TZ% is a connected set by Statement (i). (Recall Y = A in our formulation). ■ 

The proof of Theorem 9 is in Appendix H. The main implication of Theorem 9 is that the 
constrained optimal social learning scheme formulated in Chamley [11] has a monotone structure. 
This is in contrast to the multi-threshold behavior of the stopping time problem in Sec.VI-B when 
agents were selfish in picking their local actions. In [11, Chapter 4.5], the above formulation is 
used for pricing information externalities in social learning. From an implementation point of 
view, the existence of a threshold switching curve implies that the protocol only needs individual 
agents to store the optimal linear MLR policy (computed, for example, using Algorithm 1). 
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Finally, 7Zi is the union of |Y| convex sets and is non-convex in general. This is different to 
standard stopping problems where the stopping set is convex. 

VII. Example 6: Multi-agent Scheduling in a Changing World 

So far we have considered models where the underlying state is a constant (Sec. VI), or a 
Markov chain that jumps once into an absorbing state (change detection of Sec. Ill) or jumps 
twice (transient detection of Sec.IV). In this section, we consider a more general model where 
the target state x evolves on the same time scale as the observation process. The target state x 
jumps with time according to a finite state Markov chain over the state space X with transition 
probability matrix P. Also, unlike previous sections, decision u = 1 does not 'stop' the evolution 
of the belief state. So instead of a stopping time problem, we have a more general partially- 
observed stochastic control problem. 

As mentioned in Sec. I, this section is motivated by two questions: (i) How can the optimal 
policy be bounded? (ii) How does the optimal achievable cost vary with transition probability? 
The main results of this section are two-fold. First, using Blackwell dominance, we show that 
the optimal policy is lower bounded by a myopic policy (Theorem 10). Next, Theorem 11 shows 
that for the underlying Markovian state, the larger the transition matrix (in an order defined in 
(75)), the cheaper the expected optimal cost. This is useful in comparing the optimal achievable 
cost of quickest time detection with different PH-distributions. 

A. Myopic Policy Bound to Optimal Decision Policy 

Consider a countable number of agents where each agent acts once in a predetermined 
sequential order indexed by k = 1,2,... as follows: Based on the current belief state 7rjt_i, 
agent k chooses mode 

u k E {1 (low resolution) ,2 (high resolution)}. 

Depending on its mode u k , agent k views the world according to this mode - that is, it obtains 
observation from a distribution that depends on u k . Assume that for mode u £ {1,2}, the 
observation £ = {1, . . . , Y^ u '} is obtained from the matrix of conditional probabilities 

B (u) = (B^,i £ X,yM £ Y&>) where = P(y^\x = e u u). (70) 
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The notation Y*") allows for mode dependent observation spaces. In sensor scheduling [21], the 
tradeoff is as follows: Mode u — 2 yields more accurate observations of the state than mode 
u=l, but the cost of choosing mode u = 2 is higher than mode u — 1. Thus there is an tradeoff 
between the cost of acquiring information and the value of the information. The assumption that 
mode u = 2 yields more accurate observations than mode u — 1 is modelled by 

= B®Q. (71) 

Here Q is a x stochastic matrix. Q can be viewed as a confusion matrix that maps 
Y^ probabilistically to Y^ 1 '. (In a communications context, one can view Q as a noisy discrete 
memoryless channel with input and output y^). 

When agent k chooses mode Uk £ {1, 2}, it incurs the expected cost 

C(7r fc _i,u fc ) =« Ufc E{||(a; fc -E{a; fc |J- i; })'(7|| 2 |J-fc-i} + E{c(x fc , Mfc )|J- fc _ 1 } (72) 

where c u = (c(x = 6i,u),i £ X), J^. is defined in (10), g and G are defined in (14). In (72), 
the tradeoff between information obtained from a mode and the cost of operating in the mode 
is modelled as follows: Choose a.\ > a 2 to penalize choosing the less accurate mode 1 in terms 
of the variance, while c(ej, 1) < c(ej,2) since mode 1 incurs a cheaper operating cost. 

The goal is to compute the optimal policy (jffr) E {1, 2} to minimize the overall cost incurred 
by all the agents 

oo 

J m (tt ) = W^Y^P^Ci^Uk)}. (73) 
k=i 

The above problem is not a stopping time problem, since if mode u — 1 is chosen, the problem 
does not terminate. The mode u chosen by each agent will affect the modes chosen by subsequent 
agents, and hence affects the total cost. For such a partially observed stochastic control problem, 
determining the optimal policy /x*(7r) is computationally intractable. However, using Blackwell 
dominance, we show below that a myopic policy forms a lower bound for the optimal policy. 
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The value function V(ir) and optimal policy //(V) satisfy the dynamic programming equation 

V(ir) = min Q(ir, u), fJ>*M = argmin<5(vr,M), J„» M = VOir) (74) 
Q(n,u)=C(7r,u)+p ^ V (T(tt, a(vr, </")), 

0"(7T,a) 

We now present the structural result. Let II s C II (X) denote the set of belief states for which 
C(ir, 2) < C(ir, 1). Define the myopic policy 

{2 7T g rr 
1 otherwise 

Theorem 10: The myopic policy /x(7r) satisfies the following property: For all n £ II s , /x*(7r) = 
/x(7r), i.e., it is optimal to pick action 2. Therefore, /i(7r) forms is a lower bound for the optimal 
policy n*(ir), i.e., > /2(-7r) for all 7r £ n(X). ■ 

Theorem 10 is proved in Appendix I. The usefulness of Theorem 10 stems from the fact that 
/2(7r) is trivial to compute. It forms a rigorous lower bound to the computationally intractable 
optimal policy ^*(tt). What the theorem says is that P>(tt) lower bounds the optimal policy, and 
coincides with the optimal policy in region II s . Since fx is sub-optimal, it incurs a higher cost. 
This cost can be evaluated via simulation and forms an upper bound to the optimal achievable 
cost. 

Theorem 10 is non-trivial. Just because at some time k, the expected instantaneous costs 
satisfy C(7rfc,2) < C(7r fc , 1), does not necessarily imply that the myopic policy p,(ir) coincides 
with the optimal policy (J>*(tt), since the optimal policy applies to the infinite trajectory of the 
dynamical system. 

The proof uses Blackwell dominance of measures. The first instance of a similar proof using 
Blackwell dominance for POMDPs was given in [49], see also [41]. Our proof is similar and 
uses concavity of the value function V(n) and Blackwell dominance of observation probabilities. 
In particular, observation is more informative than (Blackwell dominates) observation 
if (71) holds, see [41]. The proof of Theorem 10 in the appendix comprises of first proving 
concavity of V(ir). The proof is a non trivial extension, since C(tc, 1) is nonlinear in it. 
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B. Effect of State Transition Matrix 

The next structural result establishes how the optimal expected cost varies with different 
transition matrices. The model we consider applies to both stopping time problems (such as 
quickest time detection) and multi-agent scheduling considered above. 

Suppose P^\ p( 2 ) are two distinct transition probability matrices corresponding to two distinct 
models of Markov state evolution. Let V(ir;P^) and V(ir;P^) denote the corresponding 
optimal value function in (74). The question we pose is: How does V(w; P) vary with transition 
matrix P? For example, in the quickest detection problem, do certain phase-type distributions 
result in larger total optimal cost compared to other phase-type distributions? A similar question 
can be posed for the stochastic control problem considered above. 

We consider costs that are linear in the belief state. To show the explicit dependence on the 
transition matrix, define the costs (recall c u = (c(ei,u), i £ X)) 

C(n,u;P) = c' u P'7i, ue {1,2}. 

The result below also applies to the case where C(ir,u) = c' u n (i.e., the cost at each stage is 
not an explicit function of transition matrix). 

Define the following ordering of transition matrices PW and P^: 

>_ P (2) if pUp® < p(2)pW 5 t>ji ij^mex. (75) 

Theorem 11: Assume P^ y p( 2 \ c(ei,u) is decreasing in i, C \ix , u; P^) < C(ti,u;P^) 
and (A2), (A3) hold. Then: 

(i) V(n; P^ 1 )) < V(tt; P^). That is, the larger the transition matrix (with respect to the partial 
ordering (75)), the lower the optimal expected cost incurred when making optimal decisions. 

(ii) Consider the quickest time detection problem with a = and costs in (29). The optimal 
expected cost ^(vr; P) (see (23)) with change time distribution P^ is less than that with P^ 2 \ 

■ 

Theorem 1 1 is proved in Appendix J. We now present three examples. 
Example (i): Consider X = {1, 2} and the dynamic decision making formulation of Sec.VII-A. 
Then using (75) it can be verified that the transition matrices 



0.2 


0.8 


y p( 2 ) = 


0.8 


0.2 










0.1 


0.9 




0.7 


0.3 
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Fig. 5. Monotone behaviour of optimal expected cost (value function V(n; P' p ')) versus probability p for quickest detection 
with geometric change time, see (76) for notation. The values of p are 0.99 (lowest plot), 0.95, 0.9, 0.8, and 0.01 (highest plot). 



corresponding to a two state iid process is 



Note that P^ and P^ above are TP2 (as required by (A3)). So Theorem 11 applies. 

v (i —p) 

Example (ii): The transition matrix 

v (i-p) 

decreasing with respect to the order (75) as p increases from to 1. So the theorem says that 
the smaller p is, the cheaper the optimal expected cost. So even though the underlying process 
has maximum uncertainty (entropy) when p = 0.5, Theorem 11 says that the largest total cost 
incurred is when p — 1. 

Example (Hi): Consider the quickest detection costs (29) with a = 0. Consider first the geometric 
distributed change time case (X = 2) with transition matrix P^ and parameters in (29) 



p(p) 



1 




0.8 0.2 




, B = 




1 — p p 




0.2 0.8 



, p 



0.9, d = 0.9. 



(76) 



It can be verified that P^ is increasing (in terms of (75)) in p. So Theorem 1 1 says that the 
larger p is, the smaller the average optimal cost (value function V(tt;P^)) for quickest time 
detection. In Fig. 5 we plot the value function V(tt; P^) for several values of p, to illustrate this 
behavior. 

Next, consider quickest detection with PH-distributed change times (7) modelled by the 
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following transition matrices: in quickest detection: 



p(i) 



1 
0.5 0.3 0.2 
0.3 0.4 0.3 



y p( 2 ) 



1 
0.9 0.1 
0.8 0.15 0.05 



Since P^ and P^ are TP2 by (A3), Theorem 11 implies that the optimal expected cost incurred 
in quickest change detection with PH-distributed change time P^ is less than that of P^ 2 \ 

Discussion, (i) TP2 dominance versus dominance in (75). It is shown in [28], [41] that if 
transition matrices are ordered in the TP2 sense, namely P^ > P^ (see Defn.4(i) in Appendix), 

TP2 

then Theorem 11 holds under the same assumptions as above. It is easy to prove that for 
2x2 case, only matrices with identical rows, i.e., transition matrices modelling independent 
and identically distributed (iid) finite state processes, satisfy P^ > p( 2 \ Our conjecture is 

TP2 

that the only examples of transition matrices that satisfy P^ > P^ are transition matrices 

TP2 

corresponding to iid processes. So TP2 dominance is less useful than the ordering (75). 

(ii) Kolmogorov-Shiryayev criterion: If the Kolmogorov-Shiryayev criterion (22) is considered 
then a similar proof to Theorem 11 shows that V(tt; P) is increasing with P. The reason is that 
in this case C(ir,2;P) in (24) (with de^P'n replaced by de[ir, see Theorem 2) has the term 
— p(a + /3)e' 1 P / 7r. This is increasing in P (wrt ordering (75)). 



VIII. Numerical Examples 

For state-space X = {1, 2, 3}, i.e., X = 3 states, the belief state space H(X) is an equilateral 
triangle, and the various results of this paper can be illustrated visually. There is much flexibility 
for choice of parameters that satisfy the general assumptions (Al), (A2), (A3), (S) in the 
appendix. 

Example 1: We illustrate the structural result Theorem 1 for quickest time change detection 
with PH-distributed change time and variance penalty. The following parameters were chosen 
in (29): 



(3 = 1, p=l, P 



1 
0.3 0.1 0.6 
0.02 0.98 
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Assume Gaussian observation noise with variance 0.01; so Y = R and the observation likelihoods 
are B ly ~ iV(0, 0.01), B 2y ~ iV(l,0.01). The operational cost c = 10~ 3 (see discussion below 
(17)). 

The optimal policy was computed by forming a grid of 230 values in the 2-dimensional unit 
simplex, and then solving the value iteration algorithm (27) over this grid and a horizon length 
of 200. The optimal policy is shown in Fig. 6 for four different choices of a and d. The first three 
examples, namely a, d specified in Fig. 6(a), (b) and (c), satisfy assumptions (Al-Exl), (A2), 
(A3) and (S-Exl). Therefore the optimal policy satisfies Theorem 1 and is characterized by 

a threshold curve r(7r). The figures clearly show the existence of a threshold curve that partitions 
II(X) into two individually connected regions. Recall that for the case a = 0, Theorem 1 says 
that the optimal stopping set TZi is a convex set. 

Is the stopping set 7Z\ always a connected set even when the assumptions of Theorem 1 
do not hold? Recall the assumptions in Theorem 1 are sufficient conditions. The parameters 
a — 10, d — 5 in Fig. 6(d) does not satisfy condition (S-Exl) in Appendix B. As shown in 
Fig. 6(d), the optimal stopping set 1Z\ is no longer connected. This highlights the importance 
of developing useful sufficient conditions that result in monotone policies fJ*(ix), such as the 
assumptions presented in this paper. 

Example 2: Here we consider the classical delay cost (18). We illustrate how the optimal 
stopping region 1Z\ varies with transition probabilities of the PH-distribution for change time. 
Since all these constraints in f are linear, determining feasible choices is straightforward using 
the Matlab command linprog. 

We chose B ly ~ iV(0,4), B 2y = iV(l,4) and the following parameters in (31): 










1 








a = 0.5, = 1, d = 1, p = 0.75, f = 


1 


, P = 


0.3 


0.6 


0.1 




2 




0.1 


P 


0.9 -p 



Then (A3) holds, i.e., P is TP2 for p e [0.2,0.7714]. Also it can be verified that all the other 
assumptions of Theorem 2 hold. 

Fig.7(a),(b) illustrate the optimal stopping region IZi for p = 0.2 and p = 0.77, respectively. 
Fig. 7(c) plots the PH-distribution probability mass function v k in (7) vs time k for the transition 
probabilities in Example 1 and Example 2 for p = 0.2 and p — 0.77. It can be seen that even 
with a 3-state Markov chain the behavior is quite different to a geometric distribution. 
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o.9 - e2 0.9 - ea 




(c) a = 10, d = 11 (d) a = 10, d = 5 

Fig. 6. Quickest Time Detection illustrating Theorem 1. The o region represents the stopping set TZi where decision u = 1 (stop) 
is optimal. The empty region in the simplex represents IZ2 where u — 2 (continue) is optimal. Cases (a), (b) and (c) satisfy the 
assumptions of Theorem 1. The optimal linear threshold for these 3 cases was estimated using Algorithm 1. Case (d) does not 
satisfy Assumptions (S-Exl) and IZi is not a connected set. 



IX. Conclusions 

This paper has presented structural results for Bayesian quickest time detection with PH- 
distributed change time and variance penalty. The main result is Theorem 1 which proves the 
existence of a threshold switching curve for optimal decision making under general assumptions 
(Al), (A2), (A3), (S) given in the appendix. Theorem 4 gave necessary and sufficient conditions 
for a linear threshold policy to approximate the threshold curve. Then several examples were 
considered, namely quickest transient detection, quickest time detection with exponential penalty, 
stopping time problems in social learning, constrained optimal social learning, and multi-agent 
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0.9- e 2 0.9 - &1 




5 10 15 20 25 30 35 40 45 50 



Absorption time A: 
(c) Change time distribution 

Fig. 7. Effect of PH-distribution probabilities (77) on optimal stopping region TZi. The region marked with o denotes TZi, 

scheduling in a changing world. In the case of exponential penalty we used a risk sensitive 
stochastic control formulation. In all these examples, under similar assumptions to Theorem 1, 
the threshold switching curve holds. The proofs of the results use lattice programming and 
stochastic orders on the unit simplex. The structural results of this paper are class type results, 
that is, for parameters belonging to a set, the results hold. Hence there is an inherent robustness 
in these results since even if the underlying parameters are not exactly specified but still belong 
to the appropriate sets, the results still hold. It would be useful to do a performance analysis of 
the various optimal detectors proposed in this paper - see [40] and references therein. 
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Appendix 

A. Stochastic Orders and Submodularity 

Theorem 1 below requires proving that the quickest time detection policy /j,*(tt) is monoton- 
ically increasing in belief state n. That is, ix < jf (in a sense to be made clear below), implies 
/ i *( 7r ) < A t *(^)- In order to compare belief states ir and tx, we will use the monotone likelihood 
ratio (MLR) stochastic ordering and a specialized version of the MLR order restricted to lines 
in the simplex II (X). This stochastic order is useful since it is preserved under conditional 
expectations [41], [18], [50], [32]. Below we introduce several important definitions that will be 
used subsequently. 

Definition 1 (MLR ordering, [32]): Let 7Ti,7r 2 £ H(X) be any two belief state vectors. Then 
7Ti is greater than n 2 with respect to the MLR ordering - denoted as i\\ > r n 2 , if 

7Ti(i)7r 2 (j) < 7r 2 (z)7n0'), »<j,ue{l,...,4 (78) 

Similarly 7Ti < r tv 2 if < in (78) is replaced by a >. 

Definition 2 (First order stochastic dominance, [32]): Let 7Ti, 7r 2 £ II (X). Then 7Ti first order 
stochastically dominates iv 2 - denoted as i\\ > s n 2 - if J2f = j^i(i) > Yli=j ^{i) f° r J = 

Result 1 ([32]): (i) Let 7Ti,7t 2 £ n(X). Then 7Ti > r ix 2 implies m > s tt 2 . 
(ii) Let V denote the set of all S dimensional vectors v with nondecreasing components, i.e., 
V\ < v 2 < ■ ■ ■ vx- Then ni > s n 2 iff for all v £ V, v'tti > v'n 2 . 

For state-space dimension X = 2, MLR is a complete order and coincides with first order 
stochastic dominance. For state-space dimension X > 2, MLR is a partial order, i.e., [n(X), > r ] 
is a partially ordered set (poset) since it is not always possible to order any two belief states 
7r £ n(X). However, on line segments in the simplex defined below, MLR is a total ordering. 

For i £ X, define the sub simplex "Hj C II (X) as 

Ui = {tt £ U(X) : ?r(i) = 0}. (79) 

Denote belief states that lie in Hi by it. For each tt £ construct the line segment £(ej,7f) 
that connects tx to ei. Thus £(ej, 7f) comprises of belief states n of the form: 

C(e u f) = {tt £ n(X) : tt = (1 - e)7f + ee i; < e < 1}, tT G 7£. (80) 
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Definition 3 (MLR ordering > L . on lines): i\\ is greater than 7r 2 with respect to the MLR 
ordering on the line £(ej, tt) - denoted as n 1 > L . 7r 2 , if 7Ti , n 2 G £(ej, tt) for some 7f G Hi, i.e., 
7Ti,7T2 are on the same line connected to vertex of simplex n(X), and i\\ > r n 2 . 

Note that [Il(X),> Lx ] and [Il(X),> il ] are chains 8 , i.e., all elements tt, n G £(ex,7t) are 
comparable, i.e., either tx > Lx tx or tx > Lx tx (and similarly for £(ei,7f)) In Lemma 2, we 
summarize useful properties of [II (x), >l z ] that will be used in our proofs. 

Lemma 2: The following properties hold on [IT(X), > r ], [£(ex,7f), >l x \- 

(i) On [n(X), > r ], ei is the least and is the greatest element. On [£(ex,7x), >lJ, tt is the 
least and ex is the greatest element. 

(ii) Convex combinations of MLR comparable belief states form a chain. For any 7 G [0, 1], 
71 < r iv tx < r 77r + (1 — 7)7? < r 7f. (iii) All points on a line £(ex, 7f) are MLR comparable. 
Consider any two points 7r 7l ,7r 72 G £(ex,7f) (80). Then 71 > 72, implies tt" 11 > Li 7r 72 . 

Let i = (z'i, . . . , «l) and j = (ji, . . . , j L ) denote the indices of two L-variate probability mass 
functions Denote the lattice operators 

i A j = [min(zi, ji), . . . ,mxn(i L ,j L )]', iVj = [max(z 1; j x ), . . . , max(i L , j L )]'. (81) 

Definition 4 (TP2 ordering and Reflexive TP2 distributions): Let P and Q denote any two L- 
variate probability mass functions. Then: 

(i) P > Q if P(i)Q(j) < P(i V j)Q(i A j). If P and Q are univariate, then this definition is 

TP2 

equivalent to the MLR ordering P > r Q defined above. 

(ii) A multivariate distribution P is said to be multivariate TP2 (MTP2) if P > P holds, i.e., 

TP2 

P(i)P(j)<P(iVj)P(iAj). 

(iii) If i, j G {1, . . . , X] are scalar indices, Statement (ii) is equivalent to saying that an X x X 
matrix P is TP2 if all second order minors are non-negative. Equivalently, Pj+i,. > r P v , where 
Pi . denotes the ith row of matrix P. 

To prove the existence of a threshold switching curve, we will show that Q(tt,u) in (25) is a 
submodular function on chains [n(X), >l x ] and [n(X), >lJ. 

Definition 5 ( Submodular function [48]): Suppose i — 1 or X. Then / : £(ej,7f) x {1,2} — > 
R is submodular (antitone differences) if f(n,u) — f(7i,u) < f(7T,u) — f(7i,u), for u < u, 

71 >t. 7T. 

— J^i 

8 A chain is totally ordered subset of a partially ordered set. 
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The following result says that for a submodular function Q(tt,u), fi*(n) = argmin u Q(n, u) 
is increasing in its argument ir. This implies is MLR increasing on the line segments 

C(e\, 7f) and C(e x , 7f), which in turn will be used to prove the existence of as threshold decision 
curve. 

Theorem 12 ([48]): Suppose i — 1 or X. If / : £(ej,7f) x {1,2} — > R is submodular, 
then there exists a /i*(7r) = argmin ug | 12 } f(n, u), that is increasing on [£(ei, 7f), >zj, i.e., 

7f > Li 7T =^ /X*(7r) </X*(tt). 

5. Meta-Theorem: Lattice Programming on Simplex 

We start with the following general assumptions and meta-theorem. The meta-theorem is a 
major step in all our structural results. 

(Al) C(ir, 1) and C(ti, 2) are first order stochastic decreasing in w. 
(A2) B xy is TP2. 
(A3) P is TP2. 

(S) C(ir,2) — C(ir, 1) is [£(ex,7t), >l x ] decreasing and [£(ei,7f), >lJ decreasing. 

Theorem 13: Consider the generic dynamic programming equation (25) and the above as- 
sumptions. Then the following properties hold. 

1) 7Ti > r tt 2 implies T(n u y) > r T(vr 2 ,?/) if (A3) holds. Under (A2) and (A3), a(n 1 ,y) > s 
o-(rr 2 ,y). 

2) For y,y e Y, y > y implies T^y) > r T^y) iff (A2) holds. 

3) Assumptions (Al-Exl) in Sec.III-A, (AS-Exl)(i) and (ii) in Sec.III-A, (Al-Ex2) in Sec.IV, 
(Al-Ex3) in Sec.V, and (Al-Ex5) in Sec. VI-C are sufficient conditions for (Al). 

4) Under (Al), (A2), (A3), Q(n,u) is MLR decreasing wrt > Ll and > Lx . 

5) Assumptions (S-Exl) in Sec.III-A, (AS-Exl)(i) and (iii) in Sec.III-A, (S-Ex2) in Sec.IV, 
(S-Ex3) in Sec.V and (S-Ex5) in Sec.VI-C are sufficient conditions for (S). 

6) Under (Al), (A2), (A3), (S), Q(n,u) is submodular on [U(X),> Lx ] and [U(X),> Ll ]. 
Thus by Theorem 12, the optimal policy /i*(7r) is MLR increasing on lines C(ex,^) and 
£(ei,7f). 

Part 1 and Part 2 use elementary properties of positive matrices and are proved in [41, Lemma 
4.1 and 4.2]. 
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Proof of Part 3 for C(ir, 1): To give sufficient conditions for C(n, 1) to > r decrease wrt 
7r E n(X), we start with the following convenient parametrization of the family of belief states 
that first-order stochastic dominate another belief state, see [32] for proof. 

Lemma 3: (i) For any ir, n E n(X), all belief states tx < s it are of the form 

~ = ^,e 2 ,...,e S -, tTr + Cl ( ei - e2 ) + e2 ( e2 _ e 3 ) + • • • + e X -i{e x . x - e x ). (82) 

where the variables e-,- satisfy < ej < min{l — ir(j), 7r(j + 1)}, j = 1, . . . , X — 1. Moreover, 
the e-parametrized belief state 7r ei > e2 '-> e s-i is stochastically decreasing in the elements ej. That 
is, for any j = 1, . . . ,X - 1, 6j < 6j implies that ^i.-^.-.es-i > s 7r ei,-^,..,e S -i_ a 
Remark: The above constraints on ej ensure that 7j- € i> e 2>-> € s-i is a valid belief state. 

In light of the above lemma, it suffices to prove that C(n tl ' e2 '---' ts ~ 1 , 1) is increasing in e^, 
i = 1,2, ... ,X. We introduce the following lemma. The proof follows straightforwardly using 
dF(7r eue2 '-' es - 1 )/de i > and is omitted. 

Lemma 4: Suppose F^ 1 '" 2 '-'^- 1 ) = ^V 1 ' 62 -- 65 - 1 - aQi'^ 1 ^-^- 1 ) 2 , where (p,h E R x . 
Then if > 0, a sufficient condition for i^(7r ei > e2 ----' e s-i) to be > s increasing wrt e is 

& - &+i > 2ati-K(hi - hi+i) Vtt E n(X) (83) 

If hi > is either monotone increasing or decreasing in i, then a sufficient condition for (83) is 

4>i- 4> i+ i>2ah 1 (h i - h i+1 ). (84) 

■ 

• Theorem 1: Set <p = 2ae\, h = e\ in (84). This yields 2a > and 2a > 2a which always 
hold. So C(tt, 1) is > r decreasing in n for any non-negative a. 

• Theorem 2: Set = aei — af in (84). This yields / 2 > 1 and > Clearly (AS- 
Exl)(i) and (ii) are sufficient conditions for this. In particular, P TP2 and (AS-Exl)(ii) 
implies f i+1 > fa. 

• Theorem 5: Note that the variance constraint a(e' 3 7r — (e^ir) 2 ) = a((7rx + tt 2 ) — {tx\ + 7T2) 2 )- 
Accordingly, set <\> = 2a[e\ + e2), h = e\ + e 2 in (84). This yields 2a > and 2a > 2a 
which always hold for a > 0. 

. In Theorem 6, C(7r, 1) = 0, see (50), so there is nothing to prove. 
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. In Theorem 9, to show that C(ir,l) is > r decreasing in it, it suffices to show that for each 
a E A, that c' a ir is > r decreasing in it. So in (84) choose = c a , h = 0. This yields 
(Al-Ex5). 

Proof of Part 3 for C(tt, 2): Since C(ir, 2) is linear in it, to show that C(ir, 2) > r decreases 
wrt 7T, from Result l(ii) (Appendix A), it suffices that 

C{e u 2) > C(e i+l , 2), i = 1, . . . , X - 1. (85) 

Theorem 1: This yields (Al-Exl). 

Theorem 2: (85) is equivalent to f 2 > pf'P'e 2 - ^ and (AS-Exl)(ii). Note (AS-Exl)(i) is 
sufficient for f 2 > pf'P'e 2 - 

Theorem 5: (85) is equivalent to — p(a + 0) + di + (a + /3) > — p(a + /3) + d 2 + (a + (5) > 
—p(a + (3)P^ 2 which always holds for d\ > d 2 . 

Proof of Part 4: The proof is by mathematical induction on the value iteration recursion 
(27). Clearly V (tt) = —(3(1 — e^ir) in (27) is a MLR decreasing function of it. Consider (27) 
at any stage k. Assume that Vk^ir) is MLR decreasing in it. From Part 1 and 2, it follows that 
under (A2), (A3), the term Vk (T(tt, y)) a(ir, y) is MLR decreasing in it. From Part 3, 

under (Al), C(ir,u) is MLR decreasing. Since the sum of decreasing functions is decreasing, 
the result follows. 

Proof of Part 5: Here we show that general assumption (S) at the beginning of Appendix B 
takes on the forms (S-Exl) to (S-Ex5) for the various examples. We start with the following 
characterization of belief states on lines C(ex, 7f) and C(e\, tt) and submodularity on these lines. 

Lemma 5: (i) tx\ >l x k 2 is equivalent to tti = (1 — e^it + e^ex and e\ > e 2 for tt E Hx 
where Hx is defined in (79). So submodularity on £(ex,7r) is equivalent to showing 

tt 6 = (1 - e)7t + ee x ==> C(ir e , 2) - C(n e , 1) decreasing wrt e (86) 

(ii) 7Ti >Li k 2 is equivalent to 7Tj = (1 — e^Tt H-e^ex and e\ < e 2 for tt E Hi where Hi is defined 
in (79). So submodularity on £(ei,7f) is equivalent to showing 

7T e = (1 - e)vf + eei C(tt\ 2) - C(ir\ 1) increasing wrt e. (87) 

■ 

The proof of Lemma 5 follows from Lemma 2 and is omitted. 
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Suppose C(tt, 2) — C(ir, 1) is of the form 0V + a(h'ir) 2 . Then from (86), (87), sufficient 
conditions for submodularity on C(e X) n) and £(ei,7f) are for 7f G "Hx and respectively, 

X - 0V + 2atin e {h x - tin) < 0, 0! - 0V + 2atin e (hi - tint) > (88) 

In particular if hi > and monotone increasing or decreasing in i, then (88) is equivalent to 

(j) X - 0'tt + 2ah x (h x - tin) < 0, 0i - V + 2ah x (h - tin) > (89) 

where 7f G "Hx and n E Hi, respectively. 

• Theorem 1: Set 0; = (d — p(a + (3))P'ei + (f3 — a)ei, h — ei'm (89). The first inequality is 
equivalent to: (i) {d- p{a + p)){P xl -P a ) < for % > 2 and (ii) (d- p(a + P))(l -P X i) > 
a - 13. Note that (i) holds if d > p(a + (3). 

The second inequality in (89) is equivalent to (d — p(a + /3))(1 — Pa) > a — (3. Since P is 
TP2, from footnote 4 in Sec.III-B it follows that (S-Exl) is sufficient for these inequalities 
to hold. 

. Theorem 2: Set fa = d - a, fa = -/3f { + p(a + /3)f'P'e i , i = 2, . . . ,X in (89). The 
first inequality yields (i) p(a + /3)f'P'e x <d-a + (3f x and (ii) p(a + /3)f'P'(e x - e { ) < 
(3(fx~fi), i = 2, . . . ,X-1. The second inequality in (89) yields q t > |(a + /3)f / P / e i + ^. 
These inequalities imply (AS-Exl)(i) and (iii). 

• Theorem 5: Recall that the variance constraint a(e' 3 n — (e^) 2 ) = a((ni + n 2 ) — (ni + n 2 ) 2 ). 
Set 0i = di + f3 — a - p(a + (3), 2 = d 2 + (3 - a - p(a + (3), f 3 = -p{a + j3)P 32 in 
(89). The first inequality is equivalent to p(a + (3)(1 — P 32 ) < (3 + di — a, i = 1, 2. The 
second inequality is equivalent to di > and di + (3 — a > p(a + (3)(1 — P?, 2 ). A sufficient 
condition for these is (S-Ex2). 

• In Theorem 6, since C(ir, 1) = 0, C(ir, 2) decreasing on lines C{e x , n) and £(ei, n) implies 
submodularity. This is implied by (Al-Ex3). 

. In Theorem 9, f = J2 y B y c y ~ Co/(l -p),h = 0in (89) yields (S-Ex5). 

Proof of Part 6: From Definition 5, to show that Q(n,u) is submodular, requires showing 
that Q(ir, 1) — Q(ir, 2) is >^ on lines C(e x , 7r) for i — 1 and X. Part 4 shows by induction that 
for each k, Vk(7i) is > r decreasing on H(X) if (Al), (A2), (A3) hold. This implies that Vk(n) 
is >Li decreasing on lines £(e x ,it) and £(ei,7f). So to prove Qk(n,u) in (25) is submodular, 
we only need to show that C(ir, 1) — C(ir, 2) is >l, decreasing on £(ej, n), i = 1, X . But this is 
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implied by (S) as shown in Part 5 above. Since submodularity is closed under pointwise limits 
[48, Lemma 2.6.1 and Corollary 2.6.1], it follows that Q(n,u) is submodular on > L ., i = 1,X 
Having established Q(ir, u) is submodular on > L ., i — 1, X, Theorem 12 in Appendix A implies 
that the optimal policy /i*(7r) is > Ll and >l x increasing on lines. 

C Proof of Theorem 1 

With the above key theorem, we can now prove Theorem 1 . The statement of Theorem 1 that 
/jl*(tt) is > Ll and >l x increasing is proved above. 

Statement (i): (a) Characterization of Switching Curve T. For each tx G TLx (80), construct the 
line segment C(ex, vr) connecting TLx to ex as in (80). By Lemma 2 in Appendix A, on the line 
segment connecting (1 — e)n + ee x , all belief states are MLR orderable. Since fi*(ir) is monotone 
increasing for ix G C(ex, tt), moving along this line segment towards ex, pick the largest e for 
which /i*(7r) = 1. (Since fJ*(e x ) = 1, such an e always exists). The belief state corresponding 
to this e is the threshold belief state. Denote it by r(7f) = 7r £ * ,7r G C(e x , 7f) where e* = sup{e G 
[0, l]://(7r e >*) = l}. 

The above construction implies that on £(ex,7r), there is a unique threshold point r(7f). Note 
that the entire simplex can be covered by considering all pairs of lines £(ex,Tt), for fx G TLx, 
i.e., n(X) = U^ e -^£(ex, 7f). Combining all points r(7f) for all pairs of lines £(ex, 7f), 7f G Hx, 
yields a unique threshold curve in n(X) denoted T = \J^ eH T(Tt). 

Statement (i): (b) Connectedness of regions 1Z\ and Tli. 
Connectedness of Tlx. Since e\ G Tlx, call Tl la the subset of Tlx that contains e\. Suppose Tl\ h 
was a subset of Hx that was disconnected from R la . Recall that every point in n(X) lies on 
a line segment £(ei,7f) for some tx. Then such a line segment starting from ex G 7^i a would 
leave the region Hx a , pass through a region where action 2 was optimal, and then intersect the 
region Hxb where action 1 is optimal. But this violates the requirement that n{ix) is increasing 
on £(ei,7f). Hence Hx a and Hxb have to be connected. (Note in the special case a = 0, then 
since Hx is convex (by Theorem 3), and so is obviously connected). 

Connectedness of Hi'. Assume ex G H2, otherwise H2 = and there is nothing to prove. 
Call the region Tl 2 that contains e x as H 2a - Suppose Tl 2 b C Tl 2 is disconnected from R 2a . 
Since every point in II (X) can be joined by the line segment C(ex, tt) to ex- Then such a line 
segment starting from e x G 7£ 2 a would leave the region Tl 2a , pass through a region where action 
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1 was optimal, and then intersect the region TZ 2 b (where action 2 is optimal). But this violates 
the requirement that fi(iz) is increasing on C(ex, 7f). Hence lZ 2a and lZ 2 b have to be connected. 

Statement (ii): Suppose G TZ 1 . Then considering lines £(ej,7f) and ordering > Li , it follows 
that ej„i G Tli. Similarly if ej G 7£ 2 , then considering lines £(e i+ i,7f) and ordering >L i+1 , it 
follows that e i+ i G 7£ 2 - 

Statement (Hi) follows trivially since for X = 2, n(X) is a one dimensional simplex. 

D. Proof of Theorem 3 

We first prove in the following lemma that V(tt) is concave in n. 

Lemma 6: V(ir) in (25) is concave in tt G n(X). 
Proof of Lemma 6: Our proof constructs an outer approximation V fc (7r) (defined below) to Vk(ir) 
and comprises of two steps: Step 1: (7r) is concave; Step 2: (7r) — >■ 14 (7r) uniformly as 
n — > oo. This establishes that Vfc(7r) is concave, and therefore V(ir) is concave. 

Consider n arbitrary but distinct belief states tt 1 , . . . , n n G II (X). Let 7*(vr) denote the gradient 
vector of Q(n, 1) in (27) at 7T = 7r\ z = 1, . . . , n. That is, Q(7T l , 1) = 7r t 7*(7r l ). Now construct 
a piecewise linear function out of these gradient vectors as Q^(tt, 1) = minj7 l 7r. It is easily 
seen from (24) that Q(ir, 1) is concave. Therefore Q^(ir, 1) is piecewise linear and concave in 
7r since a piecewise linear function composed of tangents to a concave function is concave. 

Construct the following auxiliary value function V^ n '(n) via value iteration similar to (27): 

V#1(tt) = min{QW(7r, lJ.Qj&fa 2)}, /4 +1 (tt) = argmin{QM(7r, 1), ^(tt, 2)} 
where Q^M) = C(tt, 2) + p £ (T(n,y)) a(n,y), Q^(n, 1) = nunfV tt G II(X). 

(90) 

Step 1: Proof of concavity of 14 (7r): We prove this by induction on the value iteration 
algorithm (90) for k = 1, 2, . . . and fixed n. Start with arbitrary concave V^(7r). As mentioned 
below (27), the VI algorithm converges for any choice of initialization. 

Since both C(ir, 2) (see (24)) and Q^ n \ir, 1) are piecewise linear in tt, it is easily seen from (90) 
that at each iteration k, (7r) is positively homogeneous, i.e., V;! n ^(A7r) = XV^ n \iT) for any 
A > 0. As a result (90) yields, Q^tt, 2) = C(tt, 2)+p E<,eY v l n \ B y^)- Now use mathematical 
induction. Assume (tt) is concave. Since B y ir is concave, and the composition of concave 
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functions is concave, V^ n \B y ir) is concave. Since C(it,2) is piecewise linear and concave, 



and the sum of concave functions is concave, it follows that qIt^tt, 2) is concave. Finally 



since minimization preserves concavity, it follows that V^V 1 {'k) = min{Q^ (tt , 1) , Q^^n , 2)} 
is concave. This completes the inductive proof. 

Step 2: Concavity of Vk(ir): Next, we show that V k [n \ix) -> V k (n) uniformly in U(X) 
implying that V k (ir) is concave. Since Q(ir, 1) is concave, it follows that Q^(7r, 1) > Q(tt, 1) and 
also Q( n )(7r, 1) is a monotone sequence of decreasing functions in n. (Intuitively, the piecewise 
linear function composed of tangents always upper bound a concave function and they become 
tighter as more piecewise linear segments are considered). From (25) and (90) this implies that 
V^ n \ir) > Vk(ir) for all k and also V^ n+1) (7r) < V^ n \n). Finally, since (i) (7r) converges 
to Vfc(7r) pointwise, (ii) H(X) is compact, (iii) v£ (n) is continuous and V k (n) is monotone 
decreasing sequence in n, it follows from [42, Theorem 7.13] that V^(7r) — > Vk(ir) uniformly 
on H(X). Therefore, Vfc(7r) is concave for k = 1, 2, . . .. Finally as discussed below (27), Vk(n) 
converges uniformly to V(ir); so V(ir) is concave. Thus Lemma 6 is proved. ■ 

The rest of the proof of Theorem 3 follows from arguments in [28]. We repeat this for 
completeness here. Our goal is to show that TZi is convex. Pick any two belief states ni,7r 2 G Hi. 
To demonstrate convexity of TZi, we need to show for any A E [0, 1], A7ri + (1 — A)7r 2 e Hi. 
Since V(n) is concave and a = 0, 

V(\ni + (1 - A)7r 2 ) > XV(tti) + (1 - X)V(n 2 ) 

= AQ(tti, 1) + (1 - A)Q(7T 2 , 1) (since tti, tt 2 G Hi) 

= Q(\tvi + (1 — A)7r 2 , 1) (since Qi(n, 1) is linear in n) 

> V^Att! + (1 — A)7r 2 ) (since V(tt) is the optimal value function) (91) 

Thus all the inequalities above are equalities, and A^i + (1 — A)7r 2 £ Hi. 

E. Proof of Theorem 4 

Given any 7r 1 ,7r 2 e C(e x ,7t) with 7r 2 > Lx 7Ti, we need to prove: fie(^i) < ^0(^2) iff 
8(X - 2) > 1, 0(i) < 8(X - 2) for i < X - 2. But from the structure of (32), obvi- 
ously /Ue(7Ti) < /Ue(7r 2 ) is equivalent to 





/ 




< 




/ 


7T 2 




1 0' 






1 0' 






, or 






-1 








-1 





, or equivalently, 



1 



0(X - 2) 



(TTl - 7T 2 ) < 0. 
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Now from Lemma 2(iii), ir 2 >l x tti implies that 7Ti = eie x + (1 — £i)7T, 7r 2 = e 2 e x + (1 — £2)^ 
and ei < e 2 . Substituting these into the above expression, we need to prove 

(ei -e 2 )(9(X-2) - [0 1 6(1) ■■■ 9{X - 2)1 ' f) < 0, W e H x 

iff 0(X - 2) > 1, 0(i) < 0(X - 2), i < X - 2. This is obviously true. 

A similar proof shows that on lines £(ei,7f) the linear threshold policy satisfies 110(1x1) < 
/i (7r 2 ) iff 0(z) > for i < X - 2. 

F. Proof of Theorem 6 

The only difference compared to the meta-theorem is the update of the belief state (49) 
which now includes the term diag(i? u ). The elements of R u are non-negative and functionally 
independent of the observation y. Therefore the three main requirements that T(ir,y) is MLR 
increasing in n, T(ir, y) is MLR increasing in Y, and a(n, :) is > r increasing in ir continue to 
hold. Then the rest of the proof is identical to Theorem 1. 



G. Proof of Theorem 8 

The proof is more complex than that of Theorem 3 since now V(7r) in is not necessarily 
concave over H(X), since T(-) and er(-) are functions of (56) which itself is an explicit (and 
in general non-concave) function of n. 

Define the matrix R n = (R?(i,a),i = {1,2}, a e {1,2}), where R n (i,a) = P(a\x = ei,ir). 
It can be verified from (56) that there are only 3 possible values for R w , namely, 



R n 



1 
1 



TT e Pi, RT = B, n e V 2 u V 3 , R w 



1 
1 



71 G V 4 



(92) 



Thus based on the dynamic programming equation (61), the value iteration algorithm reads 

K+iW = min{C(7r,2)+pK(7r)/(7r G P 1 )+p ^ K(T(vr, a))a(vr, a)[I(ir G P 2 )+/(tt G V 3 )} 

+ P K(vr)/(7rGP 4 ),0} (93) 

Assuming V n (n) is MLR decreasing on V\ U V4 straightforwardly implies V n+ i(n) is MLR 
decreasing on V\ U V± since C(ir, 2) is MLR decreasing. This proves claim (i). 
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We now prove inductively that V n (ir) is piecewise linear concave on each interval Vi, I — 
1,...,4. The proof of concavity on V\ and V4 follows straightforwardly since C(ir,2) is 
piecewise linear and concave. The proof for intervals V 2 and P 3 is more delicate. 

We need the following property of the social learning Bayesian filter. We use the following 
slight abuse in notation. Define the two dimensional vector r\i — (1 — r/i, r]i)'. 

Lemma 7: Consider the social learning Bayesian filter (55). Then T(rji, 1) = 772, T(r] 3 , 2) = r/ 2 - 
Furthermore if B is symmetric TP2, then T(r] 2 , 2) = 771, T(?7 2 , 1) = 773 and 773 < 772 < 771. So 



The proof of Lemma 7 is as follows. Recall from (92) that on intervals V 2 and V3, R n = B. 
Then it is straightforwardly verified from (55) that T(r]i, 1) = T(t] 3 , 2) = 772. Next, using (55) it 
follows that B 12 B n = B 22 B 21 is a sufficient condition for T(rj 2 , 2) = 77! and T(t7 2 , 1) = 773. Also, 
applying Theorem 13(2), B TP2 implies 773 < 772 < 771. So B symmetric TP2 is sufficient for the 
claims of the lemma to hold. Statements (i) and (ii) then follow straightforwardly. In particular, 
from Theorem 13(1), % > r n > r r\ 2 implies T(rj 1 , 1) = 772 > r T(ir, 1) > r T(77 2 , 1) = 773, which 
implies Statement (i) of the Lemma. Statement (ii) follows similarly. 

Returning to the proof of Theorem 8. Assume now that V n (7i) is piecewise linear and concave 
on each interval Vi, I — 1, . . . , 4. That is, for two dimensional vectors 7, n; in the set Ti, 



Consider n E V 2 - From (92), since = B a , a = 1,2, Lemma 7 (i) together with the value 
iteration algorithm (93) yields 



Since each of the terms in the above equation are piecewise linear and concave, it follows that 
V n+ \{n) is piecewise linear and concave on V 2 . A similar proof holds for P 3 and this involves 
using Lemma 7(ii). As a result the stopping set on each interval V\ is a convex region, i.e., an 
interval. This proves claim (ii). 

H. Proof of Theorem 9 

Part (i) follows directly from the proof of Theorems 13 and 1. For Part (ii), define the convex 
polytopes V a = {tt '■ c' a 7i < c'^tt, a ^ a}. Then on each convex polytope V a , since C(n, 1) = 



(i) n eV 2 implies T(tt, 2) e V x and T(tt, 1) e V 3 . 

(ii) 7r G V 3 implies T(tt, 2) e V 2 and T(tt, 1) e V±. 




n 



+ i(tt) = min{C(7r,2) +p min 7' B x -k + min 7' B 2 n ,0}. 
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c' a ii (recall a = 0), we can apply the argument of (91) which yields that TZi H V a is a convex 
region. Thus IZx is the union of A convex regions and is in general non-convex. However, it is 
still a connected set by part (i) of the theorem. 

/. Proof of Theorem 10 

A similar proof to Theorem 3 then establishes V(ir) is concave on n(X). We then use the 
Blackwell dominance condition (71). In particular, 

T(n,y^)= £ n^y {2) )^^P(y {l) \y {2) ) and a{<K, y«) = £ a{^)P{y^). 

y(2) e Y(2) (J \ K ^y ) y(2) e Y(2) 

Therefore ^j^| T }jP(?/ 1 )|?/ 2 ' ) ) is a probability measure wrt y( 2 \ Since V(-) is concave, using 
Jensen's inequality it follows that 

implying ^2V{T(ir,y^))a(ir,y^) > ]T V(T(vr, y^)tr(7r, y 

y m 

Therefore for n E H 



(!)| 



(2)> 

2,(1) y(2) 



C(tt, 2) + £ V (Tfo yWyfa y^) < C(tt, 1) + £ V(T(tt, 



2/(2) y m 
So for 7r G IP, the optimal policy ^(tt) = argmin ue ^ Q(tt, u) = 2. So = ^*(tt) for 

it eU s and fx(ir) = 1 otherwise, implying that fi(ir) is a lower bound for 

J. Proof of Theorem 1 1 

Identical to the proof of meta Theorem 13 in Appendix B, under the assumptions c(ej,w) 
decreasing in i, (A2), (A3), it follows that V(ir; P^) and V(tt;P^) are MLR decreasing for 
7r G n(X). We next introduce the following lemma. 

Lemma 8: [18, Theorem 2.4] P (1) >z P {2) implies P (1) V > r P (2) V where b is defined in 
(75). 

The proof of the lemma is as follows: By definition pW tt > r P^' 7r is equivalent to 



WfP W P (2) -P (2) P (1) WvT <0 
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Thus clearly (75) is a sufficient condition for P^'ix > r P^'ir. 

Returning to the proof of the theorem, if (A2), (A3) hold, it follows from Lemma 8 and meta 
Theorem 13 (Statements (1) and (2)) that for actions u £ {1,2}, 

T(vr, y, u- P«) > r T(vr, y, «; P^), a(vr, u; P«) > s a(vr, ■, u; P^). (94) 

The rest of the proof is by induction on the value iteration algorithm (74). Assume Vfc(7r; P^) < 

Vfc(7r; P (2) ) for vr £ II (X). Then from (94), 

V k (T(7r, y, u- P«); P^) < V fc (T(vr, y, u; P«); p( 2 )) < V k (T(n, y, u; P^); P^) 
Therefore, 

V k (T(7T, y, u- P«); P«) ff (7r, y, «; P«) < £ V fc (T(vr, y, «; p( 2 >); P^)a(n, y, u; P^). 
y y 

Next since C(tt, u; P (1) ) < C(tt, «; P^), it follows that 

C(tt, u; P«) + ]T 14(P(tt, y, u; P (1) ); P (1 Vfa V, «; 

< C(tt, u; P( 2 )) + £ VUT(tt, y, u; P^); p( 2 ))a(7T, y, u; P^). 

2/ 

Taking the minimum with respect to it yields Vfc+i(7r; P^) < Vk+i(n; P^). 

To prove the second claim of the theorem; first recall from (23) that V{tv; P) is the actual 
optimal expected cost. Recall that the transformation from V(ir; P) to V(n; P) was made to prove 
that the optimal policy is monotone. It is readily verified that the quickest detection problem 
(29) satisfies all the assumptions of the theorem. So V(ir; PW) < V(tt; P^). The actual optimal 
cost is V(tt; P) = V(tt; P) + (a + 0){l - e'^) (see (29)). Since (a + P)(1- e'^) is functionally 
independent of P, it then follows that V(tt; P (1) ) < V(tt; P (2) ). 
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